{"id": "http://arxiv.org/abs/2202.01208", "title": "Deep Learning for Ultrasound Speed-of-Sound Reconstruction: Impacts of Training Data Diversity on Stability and Robustness.", "authors": "Farnaz Khun Jush, Markus Biele, Peter M. Dueppenbecker, Andreas Maier", "abstract": "Ultrasound b-mode imaging is a qualitative approach and diagnostic quality strongly depends on operators' training and experience. Quantitative approaches can provide information about tissue properties; therefore, can be used for identifying various tissue types, e.g., speed-of-sound in the tissue can be used as a biomarker for tissue malignancy, especially in breast imaging. Recent studies showed the possibility of speed-of-sound reconstruction using deep neural networks that are fully trained on simulated data. However, because of the ever present domain shift between simulated and measured data, the stability and performance of these models in real setups are still under debate. In this study, we investigated the impacts of training data diversity on the robustness of these networks by using multiple kinds of geometrical and natural simulated phantom structures. On the simulated data, we investigated the performance of the networks on out-of-domain echogenicity, geometries, and in the presence of noise. We further inspected the stability of employing such tissue modeling in a real data acquisition setup. We demonstrated that training the network with a joint set of datasets including both geometrical and natural tissue models improves the stability of the predicted speed-of-sound values both on simulated and measured data.", "sentences": ["Deep Learning for Ultrasound Speed-of-Sound Reconstruction: Impacts of Training Data Diversity on Stability and Robustness.", "Ultrasound b-mode imaging is a qualitative approach and diagnostic quality strongly depends on operators' training and experience.", "Quantitative approaches can provide information about tissue properties; therefore, can be used for identifying various tissue types, e.g., speed-of-sound in the tissue can be used as a biomarker for tissue malignancy, especially in breast imaging.", "Recent studies showed the possibility of speed-of-sound reconstruction using deep neural networks that are fully trained on simulated data.", "However, because of the ever present domain shift between simulated and measured data, the stability and performance of these models in real setups are still under debate.", "In this study, we investigated the impacts of training data diversity on the robustness of these networks by using multiple kinds of geometrical and natural simulated phantom structures.", "On the simulated data, we investigated the performance of the networks on out-of-domain echogenicity, geometries, and in the presence of noise.", "We further inspected the stability of employing such tissue modeling in a real data acquisition setup.", "We demonstrated that training the network with a joint set of datasets including both geometrical and natural tissue models improves the stability of the predicted speed-of-sound values both on simulated and measured data."]}
{"id": "http://arxiv.org/abs/2202.01214", "title": "Approximate Bisimulation Relations for Neural Networks and Application to Assured Neural Network Compression.", "authors": "Weiming Xiang, Zhongzhu Shao", "abstract": "In this paper, we propose a concept of approximate bisimulation relation for feedforward neural networks. In the framework of approximate bisimulation relation, a novel neural network merging method is developed to compute the approximate bisimulation error between two neural networks based on reachability analysis of neural networks. The developed method is able to quantitatively measure the distance between the outputs of two neural networks with the same inputs. Then, we apply the approximate bisimulation relation results to perform neural networks model reduction and compute the compression precision, i.e., assured neural networks compression. At last, using the assured neural network compression, we accelerate the verification processes of ACAS Xu neural networks to illustrate the effectiveness and advantages of our proposed approximate bisimulation approach.", "sentences": ["Approximate Bisimulation Relations for Neural Networks and Application to Assured Neural Network Compression.", "In this paper, we propose a concept of approximate bisimulation relation for feedforward neural networks.", "In the framework of approximate bisimulation relation, a novel neural network merging method is developed to compute the approximate bisimulation error between two neural networks based on reachability analysis of neural networks.", "The developed method is able to quantitatively measure the distance between the outputs of two neural networks with the same inputs.", "Then, we apply the approximate bisimulation relation results to perform neural networks model reduction and compute the compression precision, i.e., assured neural networks compression.", "At last, using the assured neural network compression, we accelerate the verification processes of ACAS Xu neural networks to illustrate the effectiveness and advantages of our proposed approximate bisimulation approach."]}
{"id": "http://arxiv.org/abs/2202.01246", "title": "PolarDenseNet: A Deep Learning Model for CSI Feedback in MIMO Systems.", "authors": "Pranav Madadi, Jeongho Jeon, Joonyoung Cho, Caleb Lo, Juho Lee, Jianzhong Zhang", "abstract": "In multiple-input multiple-output (MIMO) systems, the high-resolution channel information (CSI) is required at the base station (BS) to ensure optimal performance, especially in the case of multi-user MIMO (MU-MIMO) systems. In the absence of channel reciprocity in frequency division duplex (FDD) systems, the user needs to send the CSI to the BS. Often the large overhead associated with this CSI feedback in FDD systems becomes the bottleneck in improving the system performance. In this paper, we propose an AI-based CSI feedback based on an auto-encoder architecture that encodes the CSI at UE into a low-dimensional latent space and decodes it back at the BS by effectively reducing the feedback overhead while minimizing the loss during recovery. Our simulation results show that the AI-based proposed architecture outperforms the state-of-the-art high-resolution linear combination codebook using the DFT basis adopted in the 5G New Radio (NR) system.", "sentences": ["PolarDenseNet: A Deep Learning Model for CSI Feedback in MIMO Systems.", "In multiple-input multiple-output (MIMO) systems, the high-resolution channel information (CSI) is required at the base station (BS) to ensure optimal performance, especially in the case of multi-user MIMO (MU-MIMO) systems.", "In the absence of channel reciprocity in frequency division duplex (FDD) systems, the user needs to send the CSI to the BS.", "Often the large overhead associated with this CSI feedback in FDD systems becomes the bottleneck in improving the system performance.", "In this paper, we propose an AI-based CSI feedback based on an auto-encoder architecture that encodes the CSI at UE into a low-dimensional latent space and decodes it back at the BS by effectively reducing the feedback overhead while minimizing the loss during recovery.", "Our simulation results show that the AI-based proposed architecture outperforms the state-of-the-art high-resolution linear combination codebook using the DFT basis adopted in the 5G New Radio (NR) system."]}
{"id": "http://arxiv.org/abs/2202.01256", "title": "Introduction to The Dynamic Pickup and Delivery Problem Benchmark -- ICAPS 2021 Competition.", "authors": "Jianye Hao, Jiawen Lu, Xijun Li, Xialiang Tong, Xiang Xiang, Mingxuan Yuan, Hankz Hankui Zhuo", "abstract": "The Dynamic Pickup and Delivery Problem (DPDP) is an essential problem within the logistics domain. So far, research on this problem has mainly focused on using artificial data which fails to reflect the complexity of real-world problems. In this draft, we would like to introduce a new benchmark from real business scenarios as well as a simulator supporting the dynamic evaluation. The benchmark and simulator have been published and successfully supported the ICAPS 2021 Dynamic Pickup and Delivery Problem competition participated by 152 teams.", "sentences": ["Introduction to The Dynamic Pickup and Delivery Problem Benchmark -- ICAPS 2021 Competition.", "The Dynamic Pickup and Delivery Problem (DPDP) is an essential problem within the logistics domain.", "So far, research on this problem has mainly focused on using artificial data which fails to reflect the complexity of real-world problems.", "In this draft, we would like to introduce a new benchmark from real business scenarios as well as a simulator supporting the dynamic evaluation.", "The benchmark and simulator have been published and successfully supported the ICAPS 2021 Dynamic Pickup and Delivery Problem competition participated by 152 teams."]}
{"id": "http://arxiv.org/abs/2202.01258", "title": "Accelerated Quality-Diversity for Robotics through Massive Parallelism.", "authors": "Bryan Lim, Maxime Allard, Luca Grillotti, Antoine Cully", "abstract": "Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.", "sentences": ["Accelerated Quality-Diversity for Robotics through Massive Parallelism.", "Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies.", "However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks.", "Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs.", "With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU.", "In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible.", "We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer.", "More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales.", "The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes.", "These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning."]}
{"id": "http://arxiv.org/abs/2202.01281", "title": "An Experience Report of Executive-Level Artificial Intelligence Education in the United Arab Emirates.", "authors": "David Johnson, Mohammad Alsharid, Rasheed El-Bouri, Nigel Mehdi, Farah Shamout, Alexandre Szenicer, David Toman, Saqr Binghalib", "abstract": "Teaching artificial intelligence (AI) is challenging. It is a fast moving field and therefore difficult to keep people updated with the state-of-the-art. Educational offerings for students are ever increasing, beyond university degree programs where AI education traditionally lay. In this paper, we present an experience report of teaching an AI course to business executives in the United Arab Emirates (UAE). Rather than focusing only on theoretical and technical aspects, we developed a course that teaches AI with a view to enabling students to understand how to incorporate it into existing business processes. We present an overview of our course, curriculum and teaching methods, and we discuss our reflections on teaching adult learners, and to students in the UAE.", "sentences": ["An Experience Report of Executive-Level Artificial Intelligence Education in the United Arab Emirates.", "Teaching artificial intelligence (AI) is challenging.", "It is a fast moving field and therefore difficult to keep people updated with the state-of-the-art.", "Educational offerings for students are ever increasing, beyond university degree programs where AI education traditionally lay.", "In this paper, we present an experience report of teaching an AI course to business executives in the United Arab Emirates (UAE).", "Rather than focusing only on theoretical and technical aspects, we developed a course that teaches AI with a view to enabling students to understand how to incorporate it into existing business processes.", "We present an overview of our course, curriculum and teaching methods, and we discuss our reflections on teaching adult learners, and to students in the UAE."]}
{"id": "http://arxiv.org/abs/2202.01291", "title": "Computer sciences and synthesis: retrospective and perspective.", "authors": "Vladislav Dorofeev, Petro Trokhimchuk", "abstract": "The problem of synthesis in computer sciences, including cybernetics, artificial intelligence and system analysis, is analyzed. Main methods of realization this problem are discussed. Ways of search universal method of creation universal synthetic science are represented. As example of such universal method polymetric analysis is given. Perspective of further development of this research, including application polymetric method for the resolution main problems of computer sciences, is analyzed too.", "sentences": ["Computer sciences and synthesis: retrospective and perspective.", "The problem of synthesis in computer sciences, including cybernetics, artificial intelligence and system analysis, is analyzed.", "Main methods of realization this problem are discussed.", "Ways of search universal method of creation universal synthetic science are represented.", "As example of such universal method polymetric analysis is given.", "Perspective of further development of this research, including application polymetric method for the resolution main problems of computer sciences, is analyzed too."]}
{"id": "http://arxiv.org/abs/2202.01300", "title": "Causal Inference Through the Structural Causal Marginal Problem.", "authors": "Luigi Gresele, Julius von K\u00fcgelgen, Jonas M. K\u00fcbler, Elke Kirschbaum, Bernhard Sch\u00f6lkopf, Dominik Janzing", "abstract": "We introduce an approach to counterfactual inference based on merging information from multiple datasets. We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones. We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs. Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data.", "sentences": ["Causal Inference Through the Structural Causal Marginal Problem.", "We introduce an approach to counterfactual inference based on merging information from multiple datasets.", "We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones.", "We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs.", "Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data."]}
{"id": "http://arxiv.org/abs/2202.01302", "title": "A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "authors": "Fatima Zahrah, Jason R. C. Nurse, Michael Goldsmith", "abstract": "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours. While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks. In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections. Our findings show how content and posting activity can differ depending on the platform being used. Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes. We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem.", "sentences": ["A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours.", "While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks.", "In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections.", "Our findings show how content and posting activity can differ depending on the platform being used.", "Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes.", "We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem."]}
{"id": "http://arxiv.org/abs/2202.01327", "title": "Adaptive Sampling Strategies to Construct Equitable Training Datasets.", "authors": "William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda Bogen, Stevie Bergman, Sharad Goel", "abstract": "In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.", "sentences": ["Adaptive Sampling Strategies to Construct Equitable Training Datasets.", "In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups.", "One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.", "It is often unclear, however, how to operationalize representativeness in specific applications.", "Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.", "We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups.", "We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample.", "This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task.", "When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates.", "To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection.", "We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models."]}
{"id": "http://arxiv.org/abs/2202.01334", "title": "Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization.", "authors": "Dianbo Liu, Alex Lamb, Xu Ji, Pascal Notsawo, Mike Mozer, Yoshua Bengio, Kenji Kawaguchi", "abstract": "Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.", "sentences": ["Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization.", "Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit.", "It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness.", "The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters.", "In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness.", "We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks."]}
{"id": "http://arxiv.org/abs/2202.01338", "title": "Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "authors": "Jannis Born, Matteo Manica", "abstract": "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens. This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.  Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss. Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark. In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation. This finds application particularly in property-driven, local exploration of the chemical or protein space.", "sentences": ["Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem.", "The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens.", "This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location.", "We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.", "Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss.", "Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark.", "In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation.", "This finds application particularly in property-driven, local exploration of the chemical or protein space."]}
{"id": "http://arxiv.org/abs/2202.01344", "title": "Formal Mathematics Statement Curriculum Learning.", "authors": "Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever", "abstract": "We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.", "sentences": ["Formal Mathematics Statement Curriculum Learning.", "We explore the use of expert iteration in the context of language modeling applied to formal mathematics.", "We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only.", "We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.", "Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads."]}
{"id": "http://arxiv.org/abs/2202.01351", "title": "Technology Ethics in Action: Critical and Interdisciplinary Perspectives.", "authors": "Ben Green (editor)", "abstract": "This special issue interrogates the meaning and impacts of \"tech ethics\": the embedding of ethics into digital technology research, development, use, and governance. In response to concerns about the social harms associated with digital technologies, many individuals and institutions have articulated the need for a greater emphasis on ethics in digital technology. Yet as more groups embrace the concept of ethics, critical discourses have emerged questioning whose ethics are being centered, whether \"ethics\" is the appropriate frame for improving technology, and what it means to develop \"ethical\" technology in practice. This interdisciplinary issue takes up these questions, interrogating the relationships among ethics, technology, and society in action. This special issue engages with the normative and contested notions of ethics itself, how ethics has been integrated with technology across domains, and potential paths forward to support more just and egalitarian technology. Rather than starting from philosophical theories, the authors in this issue orient their articles around the real-world discourses and impacts of tech ethics--i.e., tech ethics in action.", "sentences": ["Technology Ethics in Action: Critical and Interdisciplinary Perspectives.", "This special issue interrogates the meaning and impacts of \"tech ethics\": the embedding of ethics into digital technology research, development, use, and governance.", "In response to concerns about the social harms associated with digital technologies, many individuals and institutions have articulated the need for a greater emphasis on ethics in digital technology.", "Yet as more groups embrace the concept of ethics, critical discourses have emerged questioning whose ethics are being centered, whether \"ethics\" is the appropriate frame for improving technology, and what it means to develop \"ethical\" technology in practice.", "This interdisciplinary issue takes up these questions, interrogating the relationships among ethics, technology, and society in action.", "This special issue engages with the normative and contested notions of ethics itself, how ethics has been integrated with technology across domains, and potential paths forward to support more just and egalitarian technology.", "Rather than starting from philosophical theories, the authors in this issue orient their articles around the real-world discourses and impacts of tech ethics--i.e., tech ethics in action."]}
{"id": "http://arxiv.org/abs/2202.01356", "title": "Direct Molecular Conformation Generation.", "authors": "Jinhua Zhu, Yingce Xia, Chang Liu, Lijun Wu, Shufang Xie, Tong Wang, Yusong Wang, Wengang Zhou, Tao Qin, Houqiang Li, Tie-Yan Liu", "abstract": "Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous distance-based methods first predict interatomic distances and then generate conformations based on them, which could result in conflicting distances. In this work, we propose a method that directly predicts the coordinates of atoms. We design a dedicated loss function for conformation generation, which is invariant to roto-translation of coordinates of conformations and permutation of symmetric atoms in molecules. We further design a backbone model that stacks multiple blocks, where each block refines the conformation generated by its preceding block. Our method achieves state-of-the-art results on four public benchmarks: on small-scale GEOM-QM9 and GEOM-Drugs which have $200$K training data, we can improve the previous best matching score by $3.5\\%$ and $28.9\\%$; on large-scale GEOM-QM9 and GEOM-Drugs which have millions of training data, those two improvements are $47.1\\%$ and $36.3\\%$. This shows the effectiveness of our method and the great potential of the direct approach. Our code is released at \\url{https://github.com/DirectMolecularConfGen/DMCG}.", "sentences": ["Direct Molecular Conformation Generation.", "Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology.", "Previous distance-based methods first predict interatomic distances and then generate conformations based on them, which could result in conflicting distances.", "In this work, we propose a method that directly predicts the coordinates of atoms.", "We design a dedicated loss function for conformation generation, which is invariant to roto-translation of coordinates of conformations and permutation of symmetric atoms in molecules.", "We further design a backbone model that stacks multiple blocks, where each block refines the conformation generated by its preceding block.", "Our method achieves state-of-the-art results on four public benchmarks: on small-scale GEOM-QM9 and GEOM-Drugs which have $200$K training data, we can improve the previous best matching score by $3.5\\%$ and $28.9\\%$; on large-scale GEOM-QM9 and GEOM-Drugs which have millions of training data, those two improvements are $47.1\\%$ and $36.3\\%$.", "This shows the effectiveness of our method and the great potential of the direct approach.", "Our code is released at \\url{https://github.com/DirectMolecularConfGen/DMCG}."]}
{"id": "http://arxiv.org/abs/2202.01459", "title": "Concept Bottleneck Model with Additional Unsupervised Concepts.", "authors": "Yoshihide Sawada, Keigo Nakamura", "abstract": "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications. However, most methods utilize post-hoc approaches rather than training the interpretable model. In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM). CBM uses concept labels to train an intermediate layer as the additional visible layer. However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels. To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs). By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images. We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC). We experimentally confirmed that the proposed model outperformed CBM and SENN. We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings.", "sentences": ["Concept Bottleneck Model with Additional Unsupervised Concepts.", "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications.", "However, most methods utilize post-hoc approaches rather than training the interpretable model.", "In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM).", "CBM uses concept labels to train an intermediate layer as the additional visible layer.", "However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels.", "To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs).", "By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images.", "We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC).", "We experimentally confirmed that the proposed model outperformed CBM and SENN.", "We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings."]}
{"id": "http://arxiv.org/abs/2202.01461", "title": "ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search.", "authors": "Dixant Mittal, Siddharth Arvindan, Wee Sun Lee", "abstract": "A tree-based online search algorithm iteratively simulates trajectories and updates Q-value information on a set of states represented by a tree structure. Alternatively, policy gradient based online search algorithms update the information obtained from simulated trajectories directly onto the parameters of the policy and has been found to be effective. While tree-based methods limit the updates from simulations to the states that exist in the tree and do not interpolate the information to nearby states, policy gradient search methods do not do explicit exploration. In this paper, we show that it is possible to combine and leverage the strengths of these two methods for improved search performance. We examine the key reasons behind the improvement and propose a simple yet effective online search method, named Exploratory Policy Gradient Search (ExPoSe), that updates both the parameters of the policy as well as search information on the states in the trajectory. We conduct experiments on complex planning problems, which include Sokoban and Hamiltonian cycle search in sparse graphs and show that combining exploration with policy gradient improves online search performance.", "sentences": ["ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search.", "A tree-based online search algorithm iteratively simulates trajectories and updates Q-value information on a set of states represented by a tree structure.", "Alternatively, policy gradient based online search algorithms update the information obtained from simulated trajectories directly onto the parameters of the policy and has been found to be effective.", "While tree-based methods limit the updates from simulations to the states that exist in the tree and do not interpolate the information to nearby states, policy gradient search methods do not do explicit exploration.", "In this paper, we show that it is possible to combine and leverage the strengths of these two methods for improved search performance.", "We examine the key reasons behind the improvement and propose a simple yet effective online search method, named Exploratory Policy Gradient Search (ExPoSe), that updates both the parameters of the policy as well as search information on the states in the trajectory.", "We conduct experiments on complex planning problems, which include Sokoban and Hamiltonian cycle search in sparse graphs and show that combining exploration with policy gradient improves online search performance."]}
{"id": "http://arxiv.org/abs/2202.01473", "title": "A multi-domain virtual network embedding algorithm with delay prediction.", "authors": "Peiying Zhang, Xue Pang, Yongjing Ni, Haipeng Yao, Xin Li", "abstract": "Virtual network embedding (VNE) is an crucial part of network virtualization (NV), which aims to map the virtual networks (VNs) to a shared substrate network (SN). With the emergence of various delay-sensitive applications, how to improve the delay performance of the system has become a hot topic in academic circles. Based on extensive research, we proposed a multi-domain virtual network embedding algorithm based on delay prediction (DP-VNE). Firstly, the candidate physical nodes are selected by estimating the delay of virtual requests, then particle swarm optimization (PSO) algorithm is used to optimize the mapping process, so as to reduce the delay of the system. The simulation results show that compared with the other three advanced algorithms, the proposed algorithm can significantly reduce the system delay while keeping other indicators unaffected.", "sentences": ["A multi-domain virtual network embedding algorithm with delay prediction.", "Virtual network embedding (VNE) is an crucial part of network virtualization (NV), which aims to map the virtual networks (VNs) to a shared substrate network (SN).", "With the emergence of various delay-sensitive applications, how to improve the delay performance of the system has become a hot topic in academic circles.", "Based on extensive research, we proposed a multi-domain virtual network embedding algorithm based on delay prediction (DP-VNE).", "Firstly, the candidate physical nodes are selected by estimating the delay of virtual requests, then particle swarm optimization (PSO) algorithm is used to optimize the mapping process, so as to reduce the delay of the system.", "The simulation results show that compared with the other three advanced algorithms, the proposed algorithm can significantly reduce the system delay while keeping other indicators unaffected."]}
{"id": "http://arxiv.org/abs/2202.01494", "title": "PARCEL: Physics-based unsupervised contrastive representation learning for parallel MR imaging.", "authors": "Shanshan Wang, Ruoyou Wu, Cheng Li, Juan Zou, Hairong Zheng", "abstract": "With the successful application of deep learning in magnetic resonance imaging, parallel imaging techniques based on neural networks have attracted wide attentions. However, without high-quality fully sampled datasets for training, the performance of these methods tends to be limited. To address this issue, this paper proposes a physics based unsupervised contrastive representation learning (PARCEL) method to speed up parallel MR imaging. Specifically, PARCEL has three key ingredients to achieve direct deep learning from the undersampled k-space data. Namely, a parallel framework has been developed by learning two branches of model-based networks unrolled with the conjugate gradient algorithm; Augmented undersampled k-space data randomly drawn from the obtained k-space data are used to help the parallel network to capture the detailed information. A specially designed co-training loss is designed to guide the two networks to capture the inherent features and representations of the-to-be-reconstructed MR image. The proposed method has been evaluated on in vivo datasets and compared to five state-of-the-art methods, whose results show PARCEL is able to learn useful representations for more accurate MR reconstructions without the reliance on the fully-sampled datasets.", "sentences": ["PARCEL: Physics-based unsupervised contrastive representation learning for parallel MR imaging.", "With the successful application of deep learning in magnetic resonance imaging, parallel imaging techniques based on neural networks have attracted wide attentions.", "However, without high-quality fully sampled datasets for training, the performance of these methods tends to be limited.", "To address this issue, this paper proposes a physics based unsupervised contrastive representation learning (PARCEL) method to speed up parallel MR imaging.", "Specifically, PARCEL has three key ingredients to achieve direct deep learning from the undersampled k-space data.", "Namely, a parallel framework has been developed by learning two branches of model-based networks unrolled with the conjugate gradient algorithm; Augmented undersampled k-space data randomly drawn from the obtained k-space data are used to help the parallel network to capture the detailed information.", "A specially designed co-training loss is designed to guide the two networks to capture the inherent features and representations of the-to-be-reconstructed MR image.", "The proposed method has been evaluated on in vivo datasets and compared to five state-of-the-art methods, whose results show PARCEL is able to learn useful representations for more accurate MR reconstructions without the reliance on the fully-sampled datasets."]}
{"id": "http://arxiv.org/abs/2202.01512", "title": "Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT.", "authors": "Zonghang Li, Yihong He, Hongfang Yu, Jiawen Kang, Xiaoping Li, Zenglin Xu, Dusit Niyato", "abstract": "Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence. These data locate on decentralized devices in modern factories. To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models. However, the local data collected by different devices skew in class distribution and degrade industrial FL performance. This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d. data. Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training. Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity. The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation. We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient. Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d. data.", "sentences": ["Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT.", "Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence.", "These data locate on decentralized devices in modern factories.", "To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models.", "However, the local data collected by different devices skew in class distribution and degrade industrial FL performance.", "This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security.", "In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d.", "data.", "Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training.", "Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity.", "The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation.", "We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient.", "Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d.", "data."]}
{"id": "http://arxiv.org/abs/2202.01529", "title": "Comparative assessment of federated and centralized machine learning.", "authors": "Ibrahim Abdul Majeed, Sagar Kaushik, Aniruddha Bardhan, Venkata Siva Kumar Tadi, Hwang-Ki Min, Karthikeyan Kumaraguru, Rajasekhara Duvvuru Muni", "abstract": "Federated Learning (FL) is a privacy preserving machine learning scheme, where training happens with data federated across devices and not leaving them to sustain user privacy. This is ensured by making the untrained or partially trained models to reach directly the individual devices and getting locally trained \"on-device\" using the device owned data, and the server aggregating all the partially trained model learnings to update a global model. Although almost all the model learning schemes in the federated learning setup use gradient descent, there are certain characteristic differences brought about by the non-IID nature of the data availability, that affects the training in comparison to the centralized schemes. In this paper, we discuss the various factors that affect the federated learning training, because of the non-IID distributed nature of the data, as well as the inherent differences in the federating learning approach as against the typical centralized gradient descent techniques. We empirically demonstrate the effect of number of samples per device and the distribution of output labels on federated learning. In addition to the privacy advantage we seek through federated learning, we also study if there is a cost advantage while using federated learning frameworks. We show that federated learning does have an advantage in cost when the model sizes to be trained are not reasonably large. All in all, we present the need for careful design of model for both performance and cost.", "sentences": ["Comparative assessment of federated and centralized machine learning.", "Federated Learning (FL) is a privacy preserving machine learning scheme, where training happens with data federated across devices and not leaving them to sustain user privacy.", "This is ensured by making the untrained or partially trained models to reach directly the individual devices and getting locally trained \"on-device\" using the device owned data, and the server aggregating all the partially trained model learnings to update a global model.", "Although almost all the model learning schemes in the federated learning setup use gradient descent, there are certain characteristic differences brought about by the non-IID nature of the data availability, that affects the training in comparison to the centralized schemes.", "In this paper, we discuss the various factors that affect the federated learning training, because of the non-IID distributed nature of the data, as well as the inherent differences in the federating learning approach as against the typical centralized gradient descent techniques.", "We empirically demonstrate the effect of number of samples per device and the distribution of output labels on federated learning.", "In addition to the privacy advantage we seek through federated learning, we also study if there is a cost advantage while using federated learning frameworks.", "We show that federated learning does have an advantage in cost when the model sizes to be trained are not reasonably large.", "All in all, we present the need for careful design of model for both performance and cost."]}
{"id": "http://arxiv.org/abs/2202.01602", "title": "The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective.", "authors": "Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, Himabindu Lakkaraju", "abstract": "As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements. To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output. Our findings underscore the importance of developing principled evaluation metrics that enable practitioners to effectively compare explanations.", "sentences": ["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective.", "As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice.", "However, there is little to no research that provides answers to these critical questions.", "In this work, we introduce and study the disagreement problem in explainable machine learning.", "More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements.", "To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding.", "We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods.", "In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements.", "Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output.", "Our findings underscore the importance of developing principled evaluation metrics that enable practitioners to effectively compare explanations."]}
{"id": "http://arxiv.org/abs/2202.01606", "title": "Graph Coloring with Physics-Inspired Graph Neural Networks.", "authors": "Martin J. A. Schuetz, J. Kyle Brubaker, Zhihuai Zhu, Helmut G. Katzgraber", "abstract": "We show how graph neural networks can be used to solve the canonical graph coloring problem. We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model. Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward. We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework. Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables.", "sentences": ["Graph Coloring with Physics-Inspired Graph Neural Networks.", "We show how graph neural networks can be used to solve the canonical graph coloring problem.", "We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model.", "Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward.", "We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework.", "Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables."]}
{"id": "http://arxiv.org/abs/2202.01645", "title": "AI-as-a-Service Toolkit for Human-Centered Intelligence in Autonomous Driving.", "authors": "Valerio De Caro, Saira Bano, Achilles Machumilane, Alberto Gotta, Pietro Cassar\u00e1, Antonio Carta, Christos Sardianos, Christos Chronis, Iraklis Varlamis, Konstantinos Tserpes, Vincenzo Lomonaco, Claudio Gallicchio, Davide Bacciu", "abstract": "This paper presents a proof-of-concept implementation of the AI-as-a-service toolkit developed within the H2020 TEACHING project and designed to implement an autonomous driving personalization system according to the output of an automatic driver's stress recognition algorithm, both of them realizing a Cyber-Physical System of Systems. In addition, we implemented a data-gathering subsystem to collect data from different sensors, i.e., wearables and cameras, to automatize stress recognition. The system was attached for testing to a driving emulation software, CARLA, which allows testing the approach's feasibility with minimum cost and without putting at risk drivers and passengers. At the core of the relative subsystems, different learning algorithms were implemented using Deep Neural Networks, Recurrent Neural Networks, and Reinforcement Learning.", "sentences": ["AI-as-a-Service Toolkit for Human-Centered Intelligence in Autonomous Driving.", "This paper presents a proof-of-concept implementation of the AI-as-a-service toolkit developed within the H2020 TEACHING project and designed to implement an autonomous driving personalization system according to the output of an automatic driver's stress recognition algorithm, both of them realizing a Cyber-Physical System of Systems.", "In addition, we implemented a data-gathering subsystem to collect data from different sensors, i.e., wearables and cameras, to automatize stress recognition.", "The system was attached for testing to a driving emulation software, CARLA, which allows testing the approach's feasibility with minimum cost and without putting at risk drivers and passengers.", "At the core of the relative subsystems, different learning algorithms were implemented using Deep Neural Networks, Recurrent Neural Networks, and Reinforcement Learning."]}
{"id": "http://arxiv.org/abs/2202.01651", "title": "A Survey of Methods for Automated Algorithm Configuration.", "authors": "Elias Schede, Jasmin Brandt, Alexander Tornede, Marcel Wever, Viktor Bengs, Eyke H\u00fcllermeier, Kevin Tierney", "abstract": "Algorithm configuration (AC) is concerned with the automated search of the most suitable parameter configuration of a parametrized algorithm. There is currently a wide variety of AC problem variants and methods proposed in the literature. Existing reviews do not take into account all derivatives of the AC problem, nor do they offer a complete classification scheme. To this end, we introduce taxonomies to describe the AC problem and features of configuration methods, respectively. We review existing AC literature within the lens of our taxonomies, outline relevant design choices of configuration approaches, contrast methods and problem variants against each other, and describe the state of AC in industry. Finally, our review provides researchers and practitioners with a look at future research directions in the field of AC.", "sentences": ["A Survey of Methods for Automated Algorithm Configuration.", "Algorithm configuration (AC) is concerned with the automated search of the most suitable parameter configuration of a parametrized algorithm.", "There is currently a wide variety of AC problem variants and methods proposed in the literature.", "Existing reviews do not take into account all derivatives of the AC problem, nor do they offer a complete classification scheme.", "To this end, we introduce taxonomies to describe the AC problem and features of configuration methods, respectively.", "We review existing AC literature within the lens of our taxonomies, outline relevant design choices of configuration approaches, contrast methods and problem variants against each other, and describe the state of AC in industry.", "Finally, our review provides researchers and practitioners with a look at future research directions in the field of AC."]}
{"id": "http://arxiv.org/abs/2202.01660", "title": "Computational Aspects of Conditional Minisum Approval Voting in Elections with Interdependent Issues.", "authors": "Evangelos Markakis, Georgios Papasotiropoulos", "abstract": "Approval voting provides a simple, practical framework for multi-issue elections, and the most representative example among such election rules is the classic Minisum approval voting rule. We consider a generalization of Minisum, introduced by the work of Barrot and Lang [2016], referred to as Conditional Minisum, where voters are also allowed to express dependencies between issues. The price we have to pay when we move to this higher level of expressiveness is that we end up with a computationally hard rule. Motivated by this, we focus on the computational aspects of Conditional Minisum, where progress has been rather scarce so far. We identify restrictions that concern the voters' dependencies and the value of an optimal solution, under which we provide the first multiplicative approximation algorithms for the problem. At the same time, by additionally requiring certain structural properties for the union of dependencies cast by the whole electorate, we obtain optimal efficient algorithms for well-motivated special cases. Overall, our work provides a better understanding on the complexity implications introduced by conditional voting.", "sentences": ["Computational Aspects of Conditional Minisum Approval Voting in Elections with Interdependent Issues.", "Approval voting provides a simple, practical framework for multi-issue elections, and the most representative example among such election rules is the classic Minisum approval voting rule.", "We consider a generalization of Minisum, introduced by the work of Barrot and Lang [2016], referred to as Conditional Minisum, where voters are also allowed to express dependencies between issues.", "The price we have to pay when we move to this higher level of expressiveness is that we end up with a computationally hard rule.", "Motivated by this, we focus on the computational aspects of Conditional Minisum, where progress has been rather scarce so far.", "We identify restrictions that concern the voters' dependencies and the value of an optimal solution, under which we provide the first multiplicative approximation algorithms for the problem.", "At the same time, by additionally requiring certain structural properties for the union of dependencies cast by the whole electorate, we obtain optimal efficient algorithms for well-motivated special cases.", "Overall, our work provides a better understanding on the complexity implications introduced by conditional voting."]}
{"id": "http://arxiv.org/abs/2202.01661", "title": "Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "authors": "Anay Mehrotra, Bary S. R. Pradelski, Nisheeth K. Vishnoi", "abstract": "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.", "sentences": ["Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker.", "Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection.", "Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group.", "However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality.", "We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias.", "On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered.", "Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality."]}
{"id": "http://arxiv.org/abs/2202.01666", "title": "Equality Is Not Equity: Proportional Fairness in Federated Learning.", "authors": "Guojun Zhang, Saber Malekmohammadi, Xi Chen, Yaoliang Yu", "abstract": "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers. This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients. Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance. In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties. We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity.", "sentences": ["Equality Is Not Equity: Proportional Fairness in Federated Learning.", "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers.", "This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients.", "Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance.", "In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory.", "By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions.", "Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties.", "We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity."]}
{"id": "http://arxiv.org/abs/2202.01677", "title": "Separating Rule Discovery and Global Solution Composition in a Learning Classifier System.", "authors": "Michael Heider, Helena Stegherr, Jonathan Wurth, Roman Sraj, J\u00f6rg H\u00e4hner", "abstract": "The utilization of digital agents to support crucial decision making is increasing in many industrial scenarios. However, trust in suggestions made by these agents is hard to achieve, though essential for profiting from their application, resulting in a need for explanations for both the decision making process as well as the model itself. For many systems, such as common deep learning black-box models, achieving at least some explainability requires complex post-processing, while other systems profit from being, to a reasonable extent, inherently interpretable. In this paper we propose an easily interpretable rule-based learning system specifically designed and thus especially suited for these scenarios and compare it on a set of regression problems against XCSF, a prominent rule-based learning system with a long research history. One key advantage of our system is that the rules' conditions and which rules compose a solution to the problem are evolved separately. We utilise independent rule fitnesses which allows users to specifically tailor their model structure to fit the given requirements for explainability. We find that the results of SupRB2's evaluation are comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits. This increased control aids in subsequently providing explanations for both the training and the final structure of the model.", "sentences": ["Separating Rule Discovery and Global Solution Composition in a Learning Classifier System.", "The utilization of digital agents to support crucial decision making is increasing in many industrial scenarios.", "However, trust in suggestions made by these agents is hard to achieve, though essential for profiting from their application, resulting in a need for explanations for both the decision making process as well as the model itself.", "For many systems, such as common deep learning black-box models, achieving at least some explainability requires complex post-processing, while other systems profit from being, to a reasonable extent, inherently interpretable.", "In this paper we propose an easily interpretable rule-based learning system specifically designed and thus especially suited for these scenarios and compare it on a set of regression problems against XCSF, a prominent rule-based learning system with a long research history.", "One key advantage of our system is that the rules' conditions and which rules compose a solution to the problem are evolved separately.", "We utilise independent rule fitnesses which allows users to specifically tailor their model structure to fit the given requirements for explainability.", "We find that the results of SupRB2's evaluation are comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits.", "This increased control aids in subsequently providing explanations for both the training and the final structure of the model."]}
{"id": "http://arxiv.org/abs/2202.01690", "title": "Machine Learning and Artificial Intelligence in Next-Generation Wireless Network.", "authors": "Wafeeq Iqbal, Wei Wang, Ting Zhu", "abstract": "Due to the advancement in technologies, the next-generation wireless network will be very diverse, complicated, and according to the changed demands of the consumers. The current network operator methodologies and approaches are traditional and cannot help the next generation networks to utilize their resources most appropriately. The limited capability of the traditional tools will not allow the network providers to fulfill the demands of the network's subscribers in the future. Therefore, this paper will focus on machine learning, automation, artificial intelligence, and big data analytics for improving the capacity and effectiveness of next-generation wireless networks. The paper will discuss the role of these new technologies in improving the service and performance of the network providers in the future. The paper will find out that machine learning, big data analytics, and artificial intelligence will help in making the next-generation wireless network self-adaptive, self-aware, prescriptive, and proactive. At the end of the paper, it will be provided that future wireless network operators cannot work without shifting their operational framework to AI and machine learning technologies.", "sentences": ["Machine Learning and Artificial Intelligence in Next-Generation Wireless Network.", "Due to the advancement in technologies, the next-generation wireless network will be very diverse, complicated, and according to the changed demands of the consumers.", "The current network operator methodologies and approaches are traditional and cannot help the next generation networks to utilize their resources most appropriately.", "The limited capability of the traditional tools will not allow the network providers to fulfill the demands of the network's subscribers in the future.", "Therefore, this paper will focus on machine learning, automation, artificial intelligence, and big data analytics for improving the capacity and effectiveness of next-generation wireless networks.", "The paper will discuss the role of these new technologies in improving the service and performance of the network providers in the future.", "The paper will find out that machine learning, big data analytics, and artificial intelligence will help in making the next-generation wireless network self-adaptive, self-aware, prescriptive, and proactive.", "At the end of the paper, it will be provided that future wireless network operators cannot work without shifting their operational framework to AI and machine learning technologies."]}
{"id": "http://arxiv.org/abs/2202.01691", "title": "Modeling Bounded Rationality in Multi-Agent Simulations Using Rationally Inattentive Reinforcement Learning.", "authors": "Tong Mu, Stephan Zheng, Alexander Trott", "abstract": "Multi-agent reinforcement learning (MARL) is a powerful framework for studying emergent behavior in complex agent-based simulations. However, RL agents are often assumed to be rational and behave optimally, which does not fully reflect human behavior. Here, we study more human-like RL agents which incorporate an established model of human-irrationality, the Rational Inattention (RI) model. RI models the cost of cognitive information processing using mutual information. Our RIRL framework generalizes and is more flexible than prior work by allowing for multi-timestep dynamics and information channels with heterogeneous processing costs. We evaluate RIRL in Principal-Agent (specifically manager-employee relations) problem settings of varying complexity where RI models information asymmetry (e.g. it may be costly for the manager to observe certain information about the employees). We show that using RIRL yields a rich spectrum of new equilibrium behaviors that differ from those found under rational assumptions. For instance, some forms of a Principal's inattention can increase Agent welfare due to increased compensation, while other forms of inattention can decrease Agent welfare by encouraging extra work effort. Additionally, new strategies emerge compared to those under rationality assumptions, e.g., Agents are incentivized to increase work effort. These results suggest RIRL is a powerful tool towards building AI agents that can mimic real human behavior.", "sentences": ["Modeling Bounded Rationality in Multi-Agent Simulations Using Rationally Inattentive Reinforcement Learning.", "Multi-agent reinforcement learning (MARL) is a powerful framework for studying emergent behavior in complex agent-based simulations.", "However, RL agents are often assumed to be rational and behave optimally, which does not fully reflect human behavior.", "Here, we study more human-like RL agents which incorporate an established model of human-irrationality, the Rational Inattention (RI) model.", "RI models the cost of cognitive information processing using mutual information.", "Our RIRL framework generalizes and is more flexible than prior work by allowing for multi-timestep dynamics and information channels with heterogeneous processing costs.", "We evaluate RIRL in Principal-Agent (specifically manager-employee relations) problem settings of varying complexity where RI models information asymmetry (e.g.", "it may be costly for the manager to observe certain information about the employees).", "We show that using RIRL yields a rich spectrum of new equilibrium behaviors that differ from those found under rational assumptions.", "For instance, some forms of a Principal's inattention can increase Agent welfare due to increased compensation, while other forms of inattention can decrease Agent welfare by encouraging extra work effort.", "Additionally, new strategies emerge compared to those under rationality assumptions, e.g., Agents are incentivized to increase work effort.", "These results suggest RIRL is a powerful tool towards building AI agents that can mimic real human behavior."]}
{"id": "http://arxiv.org/abs/2202.01696", "title": "QoS-SLA-Aware Artificial Intelligence Adaptive Genetic Algorithm for Multi-Request Offloading in Integrated Edge-Cloud Computing System for the Internet of Vehicles.", "authors": "Leila Ismail, Huned Materwala, Hossam S. Hassanein", "abstract": "Internet of Vehicles (IoV) over Vehicular Ad-hoc Networks (VANETS) is an emerging technology enabling the development of smart cities applications for safer, efficient, and pleasant travel. These applications have stringent requirements expressed in Service Level Agreements (SLAs). Considering vehicles limited computational and storage capabilities, applications requests are offloaded into an integrated edge-cloud computing system. Existing offloading solutions focus on optimizing applications Quality of Service (QoS) while respecting a single SLA constraint. They do not consider the impact of overlapped requests processing. Very few contemplate the varying speed of a vehicle. This paper proposes a novel Artificial Intelligence (AI) QoS-SLA-aware genetic algorithm (GA) for multi-request offloading in a heterogeneous edge-cloud computing system, considering the impact of overlapping requests processing and dynamic vehicle speed. The objective of the optimization algorithm is to improve the applications' Quality of Service (QoS) by minimizing the total execution time. The proposed algorithm integrates an adaptive penalty function to assimilate the SLAs constraints in terms of latency, processing time, deadline, CPU, and memory requirements. Numerical experiments and comparative analysis are achieved between our proposed QoS-SLA-aware GA, random, and GA baseline approaches. The results show that the proposed algorithm executes the requests 1.22 times faster on average compared to the random approach with 59.9% less SLA violations. While the GA baseline approach increases the performance of the requests by 1.14 times, it has 19.8% more SLA violations than our approach.", "sentences": ["QoS-SLA-Aware Artificial Intelligence Adaptive Genetic Algorithm for Multi-Request Offloading in Integrated Edge-Cloud Computing System for the Internet of Vehicles.", "Internet of Vehicles (IoV) over Vehicular Ad-hoc Networks (VANETS) is an emerging technology enabling the development of smart cities applications for safer, efficient, and pleasant travel.", "These applications have stringent requirements expressed in Service Level Agreements (SLAs).", "Considering vehicles limited computational and storage capabilities, applications requests are offloaded into an integrated edge-cloud computing system.", "Existing offloading solutions focus on optimizing applications Quality of Service (QoS) while respecting a single SLA constraint.", "They do not consider the impact of overlapped requests processing.", "Very few contemplate the varying speed of a vehicle.", "This paper proposes a novel Artificial Intelligence (AI) QoS-SLA-aware genetic algorithm (GA) for multi-request offloading in a heterogeneous edge-cloud computing system, considering the impact of overlapping requests processing and dynamic vehicle speed.", "The objective of the optimization algorithm is to improve the applications' Quality of Service (QoS) by minimizing the total execution time.", "The proposed algorithm integrates an adaptive penalty function to assimilate the SLAs constraints in terms of latency, processing time, deadline, CPU, and memory requirements.", "Numerical experiments and comparative analysis are achieved between our proposed QoS-SLA-aware GA, random, and GA baseline approaches.", "The results show that the proposed algorithm executes the requests 1.22 times faster on average compared to the random approach with 59.9% less SLA violations.", "While the GA baseline approach increases the performance of the requests by 1.14 times, it has 19.8% more SLA violations than our approach."]}
{"id": "http://arxiv.org/abs/2202.01741", "title": "How to Leverage Unlabeled Data in Offline Reinforcement Learning.", "authors": "Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine", "abstract": "Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.", "sentences": ["How to Leverage Unlabeled Data in Offline Reinforcement Learning.", "Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition.", "In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive.", "How can we best leverage such unlabeled data in offline RL?", "One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data.", "In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all.", "While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results.", "We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels.", "Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings."]}
{"id": "http://arxiv.org/abs/2202.01752", "title": "Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "authors": "Yu Bai, Chi Jin, Song Mei, Tiancheng Yu", "abstract": "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.", "sentences": ["Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback.", "We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players.", "This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors.", "We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization.", "Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts.", "We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games."]}
{"id": "http://arxiv.org/abs/2202.01764", "title": "JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "authors": "ByungHoon So, Kyuhong Byun, Kyungwon Kang, Seongjin Cho", "abstract": "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set. The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD.", "sentences": ["JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer.", "Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets.", "In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans.", "JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.", "We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.", "The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD."]}
{"id": "http://arxiv.org/abs/2102.02705", "title": "EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models.", "authors": "Rajesh Bordawekar, Bulent Abali, Ming-Hung Chen", "abstract": "In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields. EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged. Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats. EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation. We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings. Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget. The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16. EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models. When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets.", "sentences": ["EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models.", "In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression.", "Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits.", "We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields.", "EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged.", "Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats.", "EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation.", "We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings.", "Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget.", "The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16.", "EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models.", "When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets."]}
{"id": "http://arxiv.org/abs/2102.07389", "title": "And/or trade-off in artificial neurons: impact on adversarial robustness.", "authors": "Alessandro Fontana", "abstract": "Since its discovery in 2013, the phenomenon of adversarial examples has attracted a growing amount of attention from the machine learning community. A deeper understanding of the problem could lead to a better comprehension of how information is processed and encoded in neural networks and, more in general, could help to solve the issue of interpretability in machine learning. Our idea to increase adversarial resilience starts with the observation that artificial neurons can be divided in two broad categories: AND-like neurons and OR-like neurons. Intuitively, the former are characterised by a relatively low number of combinations of input values which trigger neuron activation, while for the latter the opposite is true. Our hypothesis is that the presence in a network of a sufficiently high number of OR-like neurons could lead to classification \"brittleness\" and increase the network's susceptibility to adversarial attacks. After constructing an operational definition of a neuron AND-like behaviour, we proceed to introduce several measures to increase the proportion of AND-like neurons in the network: L1 norm weight normalisation; application of an input filter; comparison between the neuron output's distribution obtained when the network is fed with the actual data set and the distribution obtained when the network is fed with a randomised version of the former called \"scrambled data set\". Tests performed on the MNIST data set hint that the proposed measures could represent an interesting direction to explore.", "sentences": ["And/or trade-off in artificial neurons: impact on adversarial robustness.", "Since its discovery in 2013, the phenomenon of adversarial examples has attracted a growing amount of attention from the machine learning community.", "A deeper understanding of the problem could lead to a better comprehension of how information is processed and encoded in neural networks and, more in general, could help to solve the issue of interpretability in machine learning.", "Our idea to increase adversarial resilience starts with the observation that artificial neurons can be divided in two broad categories: AND-like neurons and OR-like neurons.", "Intuitively, the former are characterised by a relatively low number of combinations of input values which trigger neuron activation, while for the latter the opposite is true.", "Our hypothesis is that the presence in a network of a sufficiently high number of OR-like neurons could lead to classification \"brittleness\" and increase the network's susceptibility to adversarial attacks.", "After constructing an operational definition of a neuron AND-like behaviour, we proceed to introduce several measures to increase the proportion of AND-like neurons in the network: L1 norm weight normalisation; application of an input filter; comparison between the neuron output's distribution obtained when the network is fed with the actual data set and the distribution obtained when the network is fed with a randomised version of the former called \"scrambled data set\".", "Tests performed on the MNIST data set hint that the proposed measures could represent an interesting direction to explore."]}
{"id": "http://arxiv.org/abs/2103.16440", "title": "Neural Transformation Learning for Deep Anomaly Detection Beyond Images.", "authors": "Chen Qiu, Timo Pfrommer, Marius Kloft, Stephan Mandt, Maja Rudolph", "abstract": "Data transformations (e.g. rotations, reflections, and cropping) play an important role in self-supervised learning. Typically, images are transformed into different views, and neural networks trained on tasks involving these views produce useful feature representations for downstream tasks, including anomaly detection. However, for anomaly detection beyond image data, it is often unclear which transformations to use. Here we present a simple end-to-end procedure for anomaly detection with learnable transformations. The key idea is to embed the transformed data into a semantic space such that the transformed data still resemble their untransformed form, while different transformations are easily distinguishable. Extensive experiments on time series demonstrate that our proposed method outperforms existing approaches in the one-vs.-rest setting and is competitive in the more challenging n-vs.-rest anomaly detection task. On tabular datasets from the medical and cyber-security domains, our method learns domain-specific transformations and detects anomalies more accurately than previous work.", "sentences": ["Neural Transformation Learning for Deep Anomaly Detection Beyond Images.", "Data transformations (e.g.", "rotations, reflections, and cropping) play an important role in self-supervised learning.", "Typically, images are transformed into different views, and neural networks trained on tasks involving these views produce useful feature representations for downstream tasks, including anomaly detection.", "However, for anomaly detection beyond image data, it is often unclear which transformations to use.", "Here we present a simple end-to-end procedure for anomaly detection with learnable transformations.", "The key idea is to embed the transformed data into a semantic space such that the transformed data still resemble their untransformed form, while different transformations are easily distinguishable.", "Extensive experiments on time series demonstrate that our proposed method outperforms existing approaches in the one-vs.-rest setting and is competitive in the more challenging n-vs.-rest anomaly detection task.", "On tabular datasets from the medical and cyber-security domains, our method learns domain-specific transformations and detects anomalies more accurately than previous work."]}
{"id": "http://arxiv.org/abs/2105.02103", "title": "Prototype Memory for Large-scale Face Representation Learning.", "authors": "Evgeny Smirnov, Nikita Garaev, Vasiliy Galyuk, Evgeny Lukyanets", "abstract": "Face representation learning using datasets with a massive number of identities requires appropriate training methods. Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons. Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation. These methods, however, have a set of disadvantages. One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals. This problem is especially serious in ultra-large-scale datasets. In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size. Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way. New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch. These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training. To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of. Prototype Memory is computationally efficient and independent of dataset size. It can be used with various loss functions, hard example mining algorithms and encoder architectures. We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks.", "sentences": ["Prototype Memory for Large-scale Face Representation Learning.", "Face representation learning using datasets with a massive number of identities requires appropriate training methods.", "Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons.", "Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation.", "These methods, however, have a set of disadvantages.", "One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals.", "This problem is especially serious in ultra-large-scale datasets.", "In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size.", "Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way.", "New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch.", "These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training.", "To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of.", "Prototype Memory is computationally efficient and independent of dataset size.", "It can be used with various loss functions, hard example mining algorithms and encoder architectures.", "We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks."]}
{"id": "http://arxiv.org/abs/2105.07066", "title": "Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data.", "authors": "Hongda Wu, Ping Wang", "abstract": "Federated Learning (FL) is a distributed learning paradigm that enables a large number of resource-limited nodes to collaboratively train a model without data sharing. The non-independent-and-identically-distributed (non-i.i.d.) data samples invoke discrepancies between the global and local objectives, making the FL model slow to converge. In this paper, we proposed Optimal Aggregation algorithm for better aggregation, which finds out the optimal subset of local updates of participating nodes in each global round, by identifying and excluding the adverse local updates via checking the relationship between the local gradient and the global gradient. Then, we proposed a Probabilistic Node Selection framework (FedPNS) to dynamically change the probability for each node to be selected based on the output of Optimal Aggregation. FedPNS can preferentially select nodes that propel faster model convergence. The unbiasedness of the proposed FedPNS design is illustrated and the convergence rate improvement of FedPNS over the commonly adopted Federated Averaging (FedAvg) algorithm is analyzed theoretically. Experimental results demonstrate the effectiveness of FedPNS in accelerating the FL convergence rate, as compared to FedAvg with random node selection.", "sentences": ["Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data.", "Federated Learning (FL) is a distributed learning paradigm that enables a large number of resource-limited nodes to collaboratively train a model without data sharing.", "The non-independent-and-identically-distributed (non-i.i.d.)", "data samples invoke discrepancies between the global and local objectives, making the FL model slow to converge.", "In this paper, we proposed Optimal Aggregation algorithm for better aggregation, which finds out the optimal subset of local updates of participating nodes in each global round, by identifying and excluding the adverse local updates via checking the relationship between the local gradient and the global gradient.", "Then, we proposed a Probabilistic Node Selection framework (FedPNS) to dynamically change the probability for each node to be selected based on the output of Optimal Aggregation.", "FedPNS can preferentially select nodes that propel faster model convergence.", "The unbiasedness of the proposed FedPNS design is illustrated and the convergence rate improvement of FedPNS over the commonly adopted Federated Averaging (FedAvg) algorithm is analyzed theoretically.", "Experimental results demonstrate the effectiveness of FedPNS in accelerating the FL convergence rate, as compared to FedAvg with random node selection."]}
{"id": "http://arxiv.org/abs/2106.10466", "title": "TS2Vec: Towards Universal Representation of Time Series.", "authors": "Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, Bixiong Xu", "abstract": "This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp. Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps. We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations. As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets. The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks. A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting. Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature. The source code is publicly available at https://github.com/yuezhihan/ts2vec.", "sentences": ["TS2Vec: Towards Universal Representation of Time Series.", "This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level.", "Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp.", "Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps.", "We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations.", "As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets.", "The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks.", "A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting.", "Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature.", "The source code is publicly available at https://github.com/yuezhihan/ts2vec."]}
{"id": "http://arxiv.org/abs/2106.16004", "title": "What can linear interpolation of neural network loss landscapes tell us?.", "authors": "Tiffany Vlaar, Jonathan Frankle", "abstract": "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model. Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.", "sentences": ["What can linear interpolation of neural network loss landscapes tell us?.", "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems.", "Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion.", "One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization.", "On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem.", "In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.", "Further, we use linear interpolation to study the role played by individual layers and substructures of the network.", "We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model.", "Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization."]}
{"id": "http://arxiv.org/abs/2108.03033", "title": "Nonground Abductive Logic Programming with Probabilistic Integrity Constraints.", "authors": "Elena Bellodi, Marco Gavanelli, Riccardo Zese, Evelina Lamma, Fabrizio Riguzzi", "abstract": "Uncertain information is being taken into account in an increasing number of application fields. In the meantime, abduction has been proved a powerful tool for handling hypothetical reasoning and incomplete knowledge. Probabilistic logical models are a suitable framework to handle uncertain information, and in the last decade many probabilistic logical languages have been proposed, as well as inference and learning systems for them. In the realm of Abductive Logic Programming (ALP), a variety of proof procedures have been defined as well. In this paper, we consider a richer logic language, coping with probabilistic abduction with variables. In particular, we consider an ALP program enriched with integrity constraints `a la IFF, possibly annotated with a probability value. We first present the overall abductive language, and its semantics according to the Distribution Semantics. We then introduce a proof procedure, obtained by extending one previously presented, and prove its soundness and completeness.", "sentences": ["Nonground Abductive Logic Programming with Probabilistic Integrity Constraints.", "Uncertain information is being taken into account in an increasing number of application fields.", "In the meantime, abduction has been proved a powerful tool for handling hypothetical reasoning and incomplete knowledge.", "Probabilistic logical models are a suitable framework to handle uncertain information, and in the last decade many probabilistic logical languages have been proposed, as well as inference and learning systems for them.", "In the realm of Abductive Logic Programming (ALP), a variety of proof procedures have been defined as well.", "In this paper, we consider a richer logic language, coping with probabilistic abduction with variables.", "In particular, we consider an ALP program enriched with integrity constraints `a la IFF, possibly annotated with a probability value.", "We first present the overall abductive language, and its semantics according to the Distribution Semantics.", "We then introduce a proof procedure, obtained by extending one previously presented, and prove its soundness and completeness."]}
{"id": "http://arxiv.org/abs/2109.06715", "title": "IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems.", "authors": "David Pujol-Perich, Jos\u00e9 Su\u00e1rez-Varela, Miquel Ferriol, Shihan Xiao, Bo Wu, Albert Cabellos-Aparicio, Pere Barlet-Ros", "abstract": "Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking. However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming. This poses an important barrier to network engineers that often do not have the necessary ML expertise. In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems. IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures. To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases. Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow.", "sentences": ["IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems.", "Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems).", "In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing).", "The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking.", "However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming.", "This poses an important barrier to network engineers that often do not have the necessary ML expertise.", "In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems.", "IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures.", "To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases.", "Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow."]}
{"id": "http://arxiv.org/abs/2110.00269", "title": "A Survey of Knowledge Enhanced Pre-trained Models.", "authors": "Jian Yang, Gang Xiao, Yulong Shen, Wei Jiang, Xinyu Hu, Ying Zhang, Jinghui Peng", "abstract": "Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning. These models, however, suffer from poor robustness and lack of interpretability. Pre-trained models with knowledge injection, which we call knowledge enhanced pre-trained models (KEPTMs), possess deep understanding and logical reasoning and introduce interpretability to some extent. In this survey, we provide a comprehensive overview of KEPTMs for natural language processing. We first introduce the progress of pre-trained models and knowledge representation learning. Then we systematically categorize existing KEPTMs from three different perspectives. Finally, we outline some potential directions of KEPTMs for future research.", "sentences": ["A Survey of Knowledge Enhanced Pre-trained Models.", "Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning.", "These models, however, suffer from poor robustness and lack of interpretability.", "Pre-trained models with knowledge injection, which we call knowledge enhanced pre-trained models (KEPTMs), possess deep understanding and logical reasoning and introduce interpretability to some extent.", "In this survey, we provide a comprehensive overview of KEPTMs for natural language processing.", "We first introduce the progress of pre-trained models and knowledge representation learning.", "Then we systematically categorize existing KEPTMs from three different perspectives.", "Finally, we outline some potential directions of KEPTMs for future research."]}
{"id": "http://arxiv.org/abs/2110.09348", "title": "Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "authors": "Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian", "abstract": "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.", "sentences": ["Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations.", "Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image.", "Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution.", "Among these methods, contrastive learning prevents collapse via negative sample pairs.", "It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space.", "Here, we show that dimensional collapse also happens in contrastive learning.", "In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse.", "Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.", "Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet."]}
{"id": "http://arxiv.org/abs/2111.01690", "title": "Recent Advances in End-to-End Automatic Speech Recognition.", "authors": "Jinyu Li", "abstract": "Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.", "sentences": ["Recent Advances in End-to-End Automatic Speech Recognition.", "Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).", "While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time.", "There are lots of practical factors that affect the production model deployment decision.", "Traditional hybrid models, being optimized for production for decades, are usually good at these factors.", "Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized.", "In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective."]}
{"id": "http://arxiv.org/abs/2111.15208", "title": "HRNET: AI on Edge for mask detection and social distancing.", "authors": "Kinshuk Sengupta, Praveen Ranjan Srivastava", "abstract": "The purpose of the paper is to provide innovative emerging technology framework for community to combat epidemic situations. The paper proposes a unique outbreak response system framework based on artificial intelligence and edge computing for citizen centric services to help track and trace people eluding safety policies like mask detection and social distancing measure in public or workplace setup. The framework further provides implementation guideline in industrial setup as well for governance and contact tracing tasks. The adoption will thus lead in smart city planning and development focusing on citizen health systems contributing to improved quality of life. The conceptual framework presented is validated through quantitative data analysis via secondary data collection from researcher's public websites, GitHub repositories and renowned journals and further benchmarking were conducted for experimental results in Microsoft Azure cloud environment. The study includes selective AI-models for benchmark analysis and were assessed on performance and accuracy in edge computing environment for large scale societal setup. Overall YOLO model Outperforms in object detection task and is faster enough for mask detection and HRNetV2 outperform semantic segmentation problem applied to solve social distancing task in AI-Edge inferencing environmental setup. The paper proposes new Edge-AI algorithm for building technology-oriented solutions for detecting mask in human movement and social distance. The paper enriches the technological advancement in artificial intelligence and edge-computing applied to problems in society and healthcare systems. The framework further equips government agency, system providers to design and constructs technology-oriented models in community setup to Increase the quality of life using emerging technologies into smart urban environments.", "sentences": ["HRNET: AI on Edge for mask detection and social distancing.", "The purpose of the paper is to provide innovative emerging technology framework for community to combat epidemic situations.", "The paper proposes a unique outbreak response system framework based on artificial intelligence and edge computing for citizen centric services to help track and trace people eluding safety policies like mask detection and social distancing measure in public or workplace setup.", "The framework further provides implementation guideline in industrial setup as well for governance and contact tracing tasks.", "The adoption will thus lead in smart city planning and development focusing on citizen health systems contributing to improved quality of life.", "The conceptual framework presented is validated through quantitative data analysis via secondary data collection from researcher's public websites, GitHub repositories and renowned journals and further benchmarking were conducted for experimental results in Microsoft Azure cloud environment.", "The study includes selective AI-models for benchmark analysis and were assessed on performance and accuracy in edge computing environment for large scale societal setup.", "Overall YOLO model Outperforms in object detection task and is faster enough for mask detection and HRNetV2 outperform semantic segmentation problem applied to solve social distancing task in AI-Edge inferencing environmental setup.", "The paper proposes new Edge-AI algorithm for building technology-oriented solutions for detecting mask in human movement and social distance.", "The paper enriches the technological advancement in artificial intelligence and edge-computing applied to problems in society and healthcare systems.", "The framework further equips government agency, system providers to design and constructs technology-oriented models in community setup to Increase the quality of life using emerging technologies into smart urban environments."]}
{"id": "http://arxiv.org/abs/2201.10328", "title": "ML4CO-KIDA: Knowledge Inheritance in Dataset Aggregation.", "authors": "Zixuan Cao, Yang Xu, Zhewei Huang, Shuchang Zhou", "abstract": "The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models. On the dual task, we design models to make branching decisions to promote the dual bound increase faster. We propose a knowledge inheritance method to generalize knowledge of different models from the dataset aggregation process, named KIDA. Our improvement overcomes some defects of the baseline graph-neural-networks-based methods. Further, we won the $1$\\textsuperscript{st} Place on the dual task. We hope this report can provide useful experience for developers and researchers. The code is available at https://github.com/megvii-research/NeurIPS2021-ML4CO-KIDA.", "sentences": ["ML4CO-KIDA: Knowledge Inheritance in Dataset Aggregation.", "The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models.", "On the dual task, we design models to make branching decisions to promote the dual bound increase faster.", "We propose a knowledge inheritance method to generalize knowledge of different models from the dataset aggregation process, named KIDA.", "Our improvement overcomes some defects of the baseline graph-neural-networks-based methods.", "Further, we won the $1$\\textsuperscript{st} Place on the dual task.", "We hope this report can provide useful experience for developers and researchers.", "The code is available at https://github.com/megvii-research/NeurIPS2021-ML4CO-KIDA."]}
{"id": "http://arxiv.org/abs/2201.11410", "title": "Reinforcement Learning-Empowered Mobile Edge Computing for 6G Edge Intelligence.", "authors": "Peng Wei, Kun Guo, Ye Li, Jue Wang, Wei Feng, Shi Jin, Ning Ge, Ying-Chang Liang", "abstract": "Mobile edge computing (MEC) is considered a novel paradigm for computation-intensive and delay-sensitive tasks in fifth generation (5G) networks and beyond. However, its uncertainty, referred to as dynamic and randomness, from the mobile device, wireless channel, and edge network sides, results in high-dimensional, nonconvex, nonlinear, and NP-hard optimization problems. Thanks to the evolved reinforcement learning (RL), upon iteratively interacting with the dynamic and random environment, its trained agent can intelligently obtain the optimal policy in MEC. Furthermore, its evolved versions, such as deep RL (DRL), can achieve higher convergence speed efficiency and learning accuracy based on the parametric approximation for the large-scale state-action space. This paper provides a comprehensive research review on RL-enabled MEC and offers insight for development in this area. More importantly, associated with free mobility, dynamic channels, and distributed services, the MEC challenges that can be solved by different kinds of RL algorithms are identified, followed by how they can be solved by RL solutions in diverse mobile applications. Finally, the open challenges are discussed to provide helpful guidance for future research in RL training and learning MEC.", "sentences": ["Reinforcement Learning-Empowered Mobile Edge Computing for 6G Edge Intelligence.", "Mobile edge computing (MEC) is considered a novel paradigm for computation-intensive and delay-sensitive tasks in fifth generation (5G) networks and beyond.", "However, its uncertainty, referred to as dynamic and randomness, from the mobile device, wireless channel, and edge network sides, results in high-dimensional, nonconvex, nonlinear, and NP-hard optimization problems.", "Thanks to the evolved reinforcement learning (RL), upon iteratively interacting with the dynamic and random environment, its trained agent can intelligently obtain the optimal policy in MEC.", "Furthermore, its evolved versions, such as deep RL (DRL), can achieve higher convergence speed efficiency and learning accuracy based on the parametric approximation for the large-scale state-action space.", "This paper provides a comprehensive research review on RL-enabled MEC and offers insight for development in this area.", "More importantly, associated with free mobility, dynamic channels, and distributed services, the MEC challenges that can be solved by different kinds of RL algorithms are identified, followed by how they can be solved by RL solutions in diverse mobile applications.", "Finally, the open challenges are discussed to provide helpful guidance for future research in RL training and learning MEC."]}
{"id": "http://arxiv.org/abs/2201.11650", "title": "Incremental Mining of Frequent Serial Episodes Considering Multiple Occurrence.", "authors": "Thomas Guyet, Wenbin Zhang, Albert Bifet", "abstract": "The need to analyze information from streams arises in a variety of applications. One of the fundamental research directions is to mine sequential patterns over data streams. Current studies mine series of items based on the existence of the pattern in transactions but pay no attention to the series of itemsets and their multiple occurrences. The pattern over a window of itemsets stream and their multiple occurrences, however, provides additional capability to recognize the essential characteristics of the patterns and the inter-relationships among them that are unidentifiable by the existing items and existence based studies. In this paper, we study such a new sequential pattern mining problem and propose a corresponding efficient sequential miner with novel strategies to prune search space efficiently. Experiments on both real and synthetic data show the utility of our approach.", "sentences": ["Incremental Mining of Frequent Serial Episodes Considering Multiple Occurrence.", "The need to analyze information from streams arises in a variety of applications.", "One of the fundamental research directions is to mine sequential patterns over data streams.", "Current studies mine series of items based on the existence of the pattern in transactions but pay no attention to the series of itemsets and their multiple occurrences.", "The pattern over a window of itemsets stream and their multiple occurrences, however, provides additional capability to recognize the essential characteristics of the patterns and the inter-relationships among them that are unidentifiable by the existing items and existence based studies.", "In this paper, we study such a new sequential pattern mining problem and propose a corresponding efficient sequential miner with novel strategies to prune search space efficiently.", "Experiments on both real and synthetic data show the utility of our approach."]}
{"id": "http://arxiv.org/abs/2201.12855", "title": "Augmented Business Process Management Systems: A Research Manifesto.", "authors": "Marlon Dumas, Fabiana Fournier, Lior Limonad, Andrea Marrella, Marco Montali, Jana-Rebecca Rehse, Rafael Accorsi, Diego Calvanese, Giuseppe De Giacomo, Dirk Fahland, Avigdor Gal, Marcello La Rosa, Hagen V\u00f6lzer, Ingo Weber", "abstract": "Augmented Business Process Management Systems (ABPMSs) are an emerging class of process-aware information systems that draws upon trustworthy AI technology. An ABPMS enhances the execution of business processes with the aim of making these processes more adaptable, proactive, explainable, and context-sensitive. This manifesto presents a vision for ABPMSs and discusses research challenges that need to be surmounted to realize this vision. To this end, we define the concept of ABPMS, we outline the lifecycle of processes within an ABPMS, we discuss core characteristics of an ABPMS, and we derive a set of challenges to realize systems with these characteristics.", "sentences": ["Augmented Business Process Management Systems: A Research Manifesto.", "Augmented Business Process Management Systems (ABPMSs) are an emerging class of process-aware information systems that draws upon trustworthy AI technology.", "An ABPMS enhances the execution of business processes with the aim of making these processes more adaptable, proactive, explainable, and context-sensitive.", "This manifesto presents a vision for ABPMSs and discusses research challenges that need to be surmounted to realize this vision.", "To this end, we define the concept of ABPMS, we outline the lifecycle of processes within an ABPMS, we discuss core characteristics of an ABPMS, and we derive a set of challenges to realize systems with these characteristics."]}
{"id": "http://arxiv.org/abs/2201.13195", "title": "Memory-Efficient Backpropagation through Large Linear Layers.", "authors": "Daniel Bershatsky, Aleksandr Mikhalev, Alexandr Katrutsa, Julia Gusak, Daniil Merkulov, Ivan Oseledets", "abstract": "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.", "sentences": ["Memory-Efficient Backpropagation through Large Linear Layers.", "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.", "This study proposes a memory reduction approach to perform backpropagation through linear layers.", "Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy.", "Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication.", "We compare this variance with the variance coming from gradient estimation based on the batch of samples.", "We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks."]}
{"id": "http://arxiv.org/abs/2202.00063", "title": "Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach.", "authors": "Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun", "abstract": "We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration.", "sentences": ["Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach.", "We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states.", "BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space.", "Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration."]}
{"id": "http://arxiv.org/abs/2202.00441", "title": "Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction.", "authors": "Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Dimitrov, Ivan Oseledets", "abstract": "Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.", "sentences": ["Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction.", "Memory footprint is one of the main limiting factors for large neural network training.", "In backpropagation, one needs to store the input to each operation in the computational graph.", "Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients.", "We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element.", "We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming.", "The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline.", "We confirm the memory reduction and the same convergence on several open benchmarks."]}
{"id": "http://arxiv.org/abs/2202.00448", "title": "Sim2Real Object-Centric Keypoint Detection and Description.", "authors": "Chengliang Zhong, Chao Yang, Jinshan Qi, Fuchun Sun, Huaping Liu, Xiaodong Mu, Wenbing Huang", "abstract": "Keypoint detection and description play a central role in computer vision. Most existing methods are in the form of scene-level prediction, without returning the object classes of different keypoints. In this paper, we propose the object-centric formulation, which, beyond the conventional setting, requires further identifying which object each interest point belongs to. With such fine-grained information, our framework enables more downstream potentials, such as object-level matching and pose estimation in a clustered environment. To get around the difficulty of label collection in the real world, we develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications. The novelties of our training method are three-fold: (i) we integrate the uncertainty into the learning framework to improve feature description of hard cases, e.g., less-textured or symmetric patches; (ii) we decouple the object descriptor into two output branches -- intra-object salience and inter-object distinctness, resulting in a better pixel-wise description; (iii) we enforce cross-view semantic consistency for enhanced robustness in representation learning. Comprehensive experiments on image matching and 6D pose estimation verify the encouraging generalization ability of our method from simulation to reality. Particularly for 6D pose estimation, our method significantly outperforms typical unsupervised/sim2real methods, achieving a closer gap with the fully supervised counterpart. Additional results and videos can be found at https://zhongcl-thu.github.io/rock/", "sentences": ["Sim2Real Object-Centric Keypoint Detection and Description.", "Keypoint detection and description play a central role in computer vision.", "Most existing methods are in the form of scene-level prediction, without returning the object classes of different keypoints.", "In this paper, we propose the object-centric formulation, which, beyond the conventional setting, requires further identifying which object each interest point belongs to.", "With such fine-grained information, our framework enables more downstream potentials, such as object-level matching and pose estimation in a clustered environment.", "To get around the difficulty of label collection in the real world, we develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.", "The novelties of our training method are three-fold: (i) we integrate the uncertainty into the learning framework to improve feature description of hard cases, e.g., less-textured or symmetric patches; (ii) we decouple the object descriptor into two output branches -- intra-object salience and inter-object distinctness, resulting in a better pixel-wise description; (iii) we enforce cross-view semantic consistency for enhanced robustness in representation learning.", "Comprehensive experiments on image matching and 6D pose estimation verify the encouraging generalization ability of our method from simulation to reality.", "Particularly for 6D pose estimation, our method significantly outperforms typical unsupervised/sim2real methods, achieving a closer gap with the fully supervised counterpart.", "Additional results and videos can be found at https://zhongcl-thu.github.io/rock/"]}
{"id": "http://arxiv.org/abs/2202.00674", "title": "Just Another Method to Compute MTTF from Continuous Time Markov Chain.", "authors": "Eduardo M. Vasconcelos", "abstract": "The Meantime to Failure is a statistic used to determine how much time a system spends to enter one of its absorption states. This statistic can be used in most areas of knowledge. In engineering, for example, can be used as a measure of equipment reliability, and in business, as a measure of processes performance. This work presents a method to obtain the Meantime to Failure from a Continuous Time Markov Chain models. The method is intuitive and is simpler to be implemented, since, it consists of solving a system of linear equations.", "sentences": ["Just Another Method to Compute MTTF from Continuous Time Markov Chain.", "The Meantime to Failure is a statistic used to determine how much time a system spends to enter one of its absorption states.", "This statistic can be used in most areas of knowledge.", "In engineering, for example, can be used as a measure of equipment reliability, and in business, as a measure of processes performance.", "This work presents a method to obtain the Meantime to Failure from a Continuous Time Markov Chain models.", "The method is intuitive and is simpler to be implemented, since, it consists of solving a system of linear equations."]}
{"id": "http://arxiv.org/abs/2202.00677", "title": "An Embarrassingly Simple Consistency Regularization Method for Semi-Supervised Medical Image Segmentation.", "authors": "Hritam Basak, Rajarshi Bhattacharya, Rukhshanda Hussain, Agniv Chatterjee", "abstract": "The scarcity of pixel-level annotation is a prevalent problem in medical image segmentation tasks. In this paper, we introduce a novel regularization strategy involving interpolation-based mixing for semi-supervised medical image segmentation. The proposed method is a new consistency regularization strategy that encourages segmentation of interpolation of two unlabelled data to be consistent with the interpolation of segmentation maps of those data. This method represents a specific type of data-adaptive regularization paradigm which aids to minimize the overfitting of labelled data under high confidence values. The proposed method is advantageous over adversarial and generative models as it requires no additional computation. Upon evaluation on two publicly available MRI datasets: ACDC and MMWHS, experimental results demonstrate the superiority of the proposed method in comparison to existing semi-supervised models. Code is available at: https://github.com/hritam-98/ICT-MedSeg", "sentences": ["An Embarrassingly Simple Consistency Regularization Method for Semi-Supervised Medical Image Segmentation.", "The scarcity of pixel-level annotation is a prevalent problem in medical image segmentation tasks.", "In this paper, we introduce a novel regularization strategy involving interpolation-based mixing for semi-supervised medical image segmentation.", "The proposed method is a new consistency regularization strategy that encourages segmentation of interpolation of two unlabelled data to be consistent with the interpolation of segmentation maps of those data.", "This method represents a specific type of data-adaptive regularization paradigm which aids to minimize the overfitting of labelled data under high confidence values.", "The proposed method is advantageous over adversarial and generative models as it requires no additional computation.", "Upon evaluation on two publicly available MRI datasets: ACDC and MMWHS, experimental results demonstrate the superiority of the proposed method in comparison to existing semi-supervised models.", "Code is available at: https://github.com/hritam-98/ICT-MedSeg"]}
{"id": "http://arxiv.org/abs/2202.00834", "title": "Algorithms for Efficiently Learning Low-Rank Neural Networks.", "authors": "Kiran Vodrahalli, Rakesh Shivanna, Maheswaran Sathiamoorthy, Sagar Jain, Ed H. Chi", "abstract": "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable. To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet.", "sentences": ["Algorithms for Efficiently Learning Low-Rank Neural Networks.", "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices.", "First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples.", "Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable.", "To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space.", "Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension).", "Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet."]}
{"id": "http://arxiv.org/abs/2202.00941", "title": "CTMSTOU driven markets: simulated environment for regime-awareness in trading policies.", "authors": "Selim Amrouni, Aymeric Moulin, Tucker Balch", "abstract": "Market regimes is a popular topic in quantitative finance even though there is little consensus on the details of how they should be defined. They arise as a feature both in financial market prediction problems and financial market task performing problems.  In this work we use discrete event time multi-agent market simulation to freely experiment in a reproducible and understandable environment where regimes can be explicitly switched and enforced.  We introduce a novel stochastic process to model the fundamental value perceived by market participants: Continuous-Time Markov Switching Trending Ornstein-Uhlenbeck (CTMSTOU), which facilitates the study of trading policies in regime switching markets.  We define the notion of regime-awareness for a trading agent as well and illustrate its importance through the study of different order placement strategies in the context of order execution problems.", "sentences": ["CTMSTOU driven markets: simulated environment for regime-awareness in trading policies.", "Market regimes is a popular topic in quantitative finance even though there is little consensus on the details of how they should be defined.", "They arise as a feature both in financial market prediction problems and financial market task performing problems.", "In this work we use discrete event time multi-agent market simulation to freely experiment in a reproducible and understandable environment where regimes can be explicitly switched and enforced.", "We introduce a novel stochastic process to model the fundamental value perceived by market participants: Continuous-Time Markov Switching Trending Ornstein-Uhlenbeck (CTMSTOU), which facilitates the study of trading policies in regime switching markets.", "We define the notion of regime-awareness for a trading agent as well and illustrate its importance through the study of different order placement strategies in the context of order execution problems."]}
{"id": "http://arxiv.org/abs/2202.01011", "title": "Auto-Transfer: Learning to Route Transferrable Representations.", "authors": "Keerthiram Murugesan, Vijay Sadashivaiah, Ronny Luss, Karthikeyan Shanmugam, Pin-Yu Chen, Amit Dhurandhar", "abstract": "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications. Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting. We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models. We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet. We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors. We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning.", "sentences": ["Auto-Transfer: Learning to Route Transferrable Representations.", "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications.", "Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting.", "We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models.", "We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet.", "We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors.", "We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning."]}
{"id": "http://arxiv.org/abs/2202.01211", "title": "A Flexible Clustering Pipeline for Mining Text Intentions.", "authors": "Xinyu Chen, Ian Beaver", "abstract": "Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support. We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. We describe the pipeline and demonstrate its performance using BERT on three real-world text mining tasks. As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.", "sentences": ["A Flexible Clustering Pipeline for Mining Text Intentions.", "Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support.", "We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts.", "The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection.", "We describe the pipeline and demonstrate its performance using BERT on three real-world text mining tasks.", "As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains."]}
{"id": "http://arxiv.org/abs/2202.01279", "title": "PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.", "authors": "Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Xiangru Tang, Mike Tian-Jian Jiang, Alexander M. Rush", "abstract": "PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource. PromptSource is available at https://github.com/bigscience-workshop/promptsource.", "sentences": ["PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.", "PromptSource is a system for creating, sharing, and using natural language prompts.", "Prompts are functions that map an example from a dataset to a natural language input and target output.", "Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively.", "PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool.", "Over 2,000 prompts for roughly 170 datasets are already available in PromptSource.", "PromptSource is available at https://github.com/bigscience-workshop/promptsource."]}
{"id": "http://arxiv.org/abs/2202.01286", "title": "ASR-Aware End-to-end Neural Diarization.", "authors": "Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke", "abstract": "We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.", "sentences": ["ASR-Aware End-to-end Neural Diarization.", "We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.", "Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output.", "Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.", "First, ASR features are concatenated with acoustic features.", "Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations.", "Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss.", "Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline."]}
{"id": "http://arxiv.org/abs/2202.01302", "title": "A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "authors": "Fatima Zahrah, Jason R. C. Nurse, Michael Goldsmith", "abstract": "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours. While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks. In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections. Our findings show how content and posting activity can differ depending on the platform being used. Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes. We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem.", "sentences": ["A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours.", "While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks.", "In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections.", "Our findings show how content and posting activity can differ depending on the platform being used.", "Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes.", "We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem."]}
{"id": "http://arxiv.org/abs/2202.01338", "title": "Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "authors": "Jannis Born, Matteo Manica", "abstract": "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens. This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.  Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss. Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark. In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation. This finds application particularly in property-driven, local exploration of the chemical or protein space.", "sentences": ["Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem.", "The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens.", "This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location.", "We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.", "Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss.", "Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark.", "In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation.", "This finds application particularly in property-driven, local exploration of the chemical or protein space."]}
{"id": "http://arxiv.org/abs/2202.01374", "title": "mSLAM: Massively multilingual joint pre-training for speech and text.", "authors": "Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau", "abstract": "We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.", "sentences": ["mSLAM: Massively multilingual joint pre-training for speech and text.", "We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.", "mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space.", "We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training.", "Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations.", "mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process.", "Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research."]}
{"id": "http://arxiv.org/abs/2202.01405", "title": "Joint Speech Recognition and Audio Captioning.", "authors": "Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe", "abstract": "Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods.", "sentences": ["Joint Speech Recognition and Audio Captioning.", "Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.", "Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models.", "For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR).", "The goal of AAC is to generate natural language descriptions of contents in audio samples.", "We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently.", "A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions.", "Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset.", "We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods."]}
{"id": "http://arxiv.org/abs/2202.01624", "title": "MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances.", "authors": "Tianchi Liu, Rohan Kumar Das, Kong Aik Lee, Haizhou Li", "abstract": "The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances.", "sentences": ["MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances.", "The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification.", "However, they require a large number of filters to capture the speaker characteristics at any local frequency region.", "In addition, the performance of such systems may degrade under short utterance scenarios.", "To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.", "We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity.", "Further, the MFA mechanism is found to be effective for speaker verification with short test utterances."]}
{"id": "http://arxiv.org/abs/2202.01708", "title": "The relationship between sentiment score and COVID-19 cases in the United States.", "authors": "Truong Luu, Rosangela Follmann", "abstract": "The coronavirus disease (COVID-19) continues to have devastating effects across the globe. No nation has been free from the uncertainty brought by this pandemic. The health, social and economic tolls associated with it are causing strong emotions and spreading fear in people of all ages, genders, and races. Since the beginning of the COVID-19 pandemic, many have expressed their feelings and opinions related to a wide range of aspects of their lives via Twitter. In this study, we consider a framework for extracting sentiment scores and opinions from COVID-19 related tweets. We connect users' sentiment with COVID-19 cases across the USA and investigate the effect of specific COVID-19 milestones on public sentiment. The results of this work may help with the development of pandemic-related legislation, serve as a guide for scientific work, as well as inform and educate the public on core issues related to the pandemic.", "sentences": ["The relationship between sentiment score and COVID-19 cases in the United States.", "The coronavirus disease (COVID-19) continues to have devastating effects across the globe.", "No nation has been free from the uncertainty brought by this pandemic.", "The health, social and economic tolls associated with it are causing strong emotions and spreading fear in people of all ages, genders, and races.", "Since the beginning of the COVID-19 pandemic, many have expressed their feelings and opinions related to a wide range of aspects of their lives via Twitter.", "In this study, we consider a framework for extracting sentiment scores and opinions from COVID-19 related tweets.", "We connect users' sentiment with COVID-19 cases across the USA and investigate the effect of specific COVID-19 milestones on public sentiment.", "The results of this work may help with the development of pandemic-related legislation, serve as a guide for scientific work, as well as inform and educate the public on core issues related to the pandemic."]}
{"id": "http://arxiv.org/abs/2202.01709", "title": "Towards Coherent and Consistent Use of Entities in Narrative Generation.", "authors": "Pinelopi Papalampidi, Kris Cao, Tomas Kocisky", "abstract": "Large pre-trained language models (LMs) have demonstrated impressive capabilities in generating long, fluent text; however, there is little to no analysis on their ability to maintain entity coherence and consistency. In this work, we focus on the end task of narrative generation and systematically analyse the long-range entity coherence and consistency in generated stories. First, we propose a set of automatic metrics for measuring model performance in terms of entity usage. Given these metrics, we quantify the limitations of current LMs. Next, we propose augmenting a pre-trained LM with a dynamic entity memory in an end-to-end manner by using an auxiliary entity-related loss for guiding the reads and writes to the memory. We demonstrate that the dynamic entity memory increases entity coherence according to both automatic and human judgment and helps preserving entity-related information especially in settings with a limited context window. Finally, we also validate that our automatic metrics are correlated with human ratings and serve as a good indicator of the quality of generated stories.", "sentences": ["Towards Coherent and Consistent Use of Entities in Narrative Generation.", "Large pre-trained language models (LMs) have demonstrated impressive capabilities in generating long, fluent text; however, there is little to no analysis on their ability to maintain entity coherence and consistency.", "In this work, we focus on the end task of narrative generation and systematically analyse the long-range entity coherence and consistency in generated stories.", "First, we propose a set of automatic metrics for measuring model performance in terms of entity usage.", "Given these metrics, we quantify the limitations of current LMs.", "Next, we propose augmenting a pre-trained LM with a dynamic entity memory in an end-to-end manner by using an auxiliary entity-related loss for guiding the reads and writes to the memory.", "We demonstrate that the dynamic entity memory increases entity coherence according to both automatic and human judgment and helps preserving entity-related information especially in settings with a limited context window.", "Finally, we also validate that our automatic metrics are correlated with human ratings and serve as a good indicator of the quality of generated stories."]}
{"id": "http://arxiv.org/abs/2202.01764", "title": "JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "authors": "ByungHoon So, Kyuhong Byun, Kyungwon Kang, Seongjin Cho", "abstract": "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set. The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD.", "sentences": ["JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer.", "Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets.", "In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans.", "JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.", "We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.", "The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD."]}
{"id": "http://arxiv.org/abs/2202.01771", "title": "Pre-Trained Language Models for Interactive Decision-Making.", "authors": "Shuang Li, Xavier Puig, Yilun Du, Clinton Wang, Ekin Akyurek, Antonio Torralba, Jacob Andreas, Igor Mordatch", "abstract": "Language model (LM) pre-training has proven useful for a wide variety of language processing tasks, but can such pre-training be leveraged for more general machine learning problems? We investigate the effectiveness of language modeling to scaffold learning and generalization in autonomous decision-making. We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings, and translated into actions using a policy network initialized with a pre-trained transformer LM. We demonstrate that this framework enables effective combinatorial generalization across different environments, such as VirtualHome and BabyAI. In particular, for test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6% in VirtualHome. We hypothesize and investigate three possible factors underlying the effectiveness of LM-based policy initialization. We find that sequential representations (vs. fixed-dimensional feature vectors) and the LM objective (not just the transformer architecture) are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.", "sentences": ["Pre-Trained Language Models for Interactive Decision-Making.", "Language model (LM) pre-training has proven useful for a wide variety of language processing tasks, but can such pre-training be leveraged for more general machine learning problems?", "We investigate the effectiveness of language modeling to scaffold learning and generalization in autonomous decision-making.", "We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings, and translated into actions using a policy network initialized with a pre-trained transformer LM.", "We demonstrate that this framework enables effective combinatorial generalization across different environments, such as VirtualHome and BabyAI.", "In particular, for test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6% in VirtualHome.", "We hypothesize and investigate three possible factors underlying the effectiveness of LM-based policy initialization.", "We find that sequential representations (vs. fixed-dimensional feature vectors) and the LM objective (not just the transformer architecture) are both important for generalization.", "Surprisingly, however, the format of the policy inputs encoding (e.g.", "as a natural language string vs. an arbitrary sequential encoding) has little influence.", "Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing."]}
{"id": "http://arxiv.org/abs/2010.06835", "title": "A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering.", "authors": "Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, Raviteja Anantha", "abstract": "The dependency between an adequate question formulation and correct answer selection is a very intriguing but still underexplored area. In this paper, we show that question rewriting (QR) of the conversational context allows to shed more light on this phenomenon and also use it to evaluate robustness of different answer selection approaches. We introduce a simple framework that enables an automated analysis of the conversational question answering (QA) performance using question rewrites, and present the results of this analysis on the TREC CAsT and QuAC (CANARD) datasets. Our experiments uncover sensitivity to question formulation of the popular state-of-the-art models for reading comprehension and passage ranking. Our results demonstrate that the reading comprehension model is insensitive to question formulation, while the passage ranking changes dramatically with a little variation in the input question. The benefit of QR is that it allows us to pinpoint and group such cases automatically. We show how to use this methodology to verify whether QA models are really learning the task or just finding shortcuts in the dataset, and better understand the frequent types of error they make.", "sentences": ["A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering.", "The dependency between an adequate question formulation and correct answer selection is a very intriguing but still underexplored area.", "In this paper, we show that question rewriting (QR) of the conversational context allows to shed more light on this phenomenon and also use it to evaluate robustness of different answer selection approaches.", "We introduce a simple framework that enables an automated analysis of the conversational question answering (QA) performance using question rewrites, and present the results of this analysis on the TREC CAsT and QuAC (CANARD) datasets.", "Our experiments uncover sensitivity to question formulation of the popular state-of-the-art models for reading comprehension and passage ranking.", "Our results demonstrate that the reading comprehension model is insensitive to question formulation, while the passage ranking changes dramatically with a little variation in the input question.", "The benefit of QR is that it allows us to pinpoint and group such cases automatically.", "We show how to use this methodology to verify whether QA models are really learning the task or just finding shortcuts in the dataset, and better understand the frequent types of error they make."]}
{"id": "http://arxiv.org/abs/2110.00269", "title": "A Survey of Knowledge Enhanced Pre-trained Models.", "authors": "Jian Yang, Gang Xiao, Yulong Shen, Wei Jiang, Xinyu Hu, Ying Zhang, Jinghui Peng", "abstract": "Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning. These models, however, suffer from poor robustness and lack of interpretability. Pre-trained models with knowledge injection, which we call knowledge enhanced pre-trained models (KEPTMs), possess deep understanding and logical reasoning and introduce interpretability to some extent. In this survey, we provide a comprehensive overview of KEPTMs for natural language processing. We first introduce the progress of pre-trained models and knowledge representation learning. Then we systematically categorize existing KEPTMs from three different perspectives. Finally, we outline some potential directions of KEPTMs for future research.", "sentences": ["A Survey of Knowledge Enhanced Pre-trained Models.", "Pre-trained models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which has achieved promising performance after fine-tuning.", "These models, however, suffer from poor robustness and lack of interpretability.", "Pre-trained models with knowledge injection, which we call knowledge enhanced pre-trained models (KEPTMs), possess deep understanding and logical reasoning and introduce interpretability to some extent.", "In this survey, we provide a comprehensive overview of KEPTMs for natural language processing.", "We first introduce the progress of pre-trained models and knowledge representation learning.", "Then we systematically categorize existing KEPTMs from three different perspectives.", "Finally, we outline some potential directions of KEPTMs for future research."]}
{"id": "http://arxiv.org/abs/2110.02375", "title": "Interpreting intermediate convolutional layers in unsupervised acoustic word classification.", "authors": "Ga\u0161per Begu\u0161, Alan Zhou", "abstract": "Understanding how deep convolutional neural networks classify data has been subject to extensive research. This paper proposes a technique to visualize and interpret intermediate layers of unsupervised deep convolutional networks by averaging over individual feature maps in each convolutional layer and inferring underlying distributions of words with non-linear regression techniques. A GAN-based architecture (ciwGAN arXiv:2006.02951) that includes a Generator, a Discriminator, and a classifier was trained on unlabeled sliced lexical items from TIMIT. The training process results in a deep convolutional network that learns to classify words into discrete classes only from the requirement of the Generator to output informative data. This classifier network has no access to the training data -- only to the generated data. We propose a technique to visualize individual convolutional layers in the classifier that yields highly informative time-series data for each convolutional layer and apply it to unobserved test data. Using non-linear regression, we infer underlying distributions for each word which allows us to analyze both absolute values and shapes of individual words at different convolutional layers, as well as perform hypothesis testing on their acoustic properties. The technique also allows us to test individual phone contrasts and how they are represented at each layer.", "sentences": ["Interpreting intermediate convolutional layers in unsupervised acoustic word classification.", "Understanding how deep convolutional neural networks classify data has been subject to extensive research.", "This paper proposes a technique to visualize and interpret intermediate layers of unsupervised deep convolutional networks by averaging over individual feature maps in each convolutional layer and inferring underlying distributions of words with non-linear regression techniques.", "A GAN-based architecture (ciwGAN arXiv:2006.02951) that includes a Generator, a Discriminator, and a classifier was trained on unlabeled sliced lexical items from TIMIT.", "The training process results in a deep convolutional network that learns to classify words into discrete classes only from the requirement of the Generator to output informative data.", "This classifier network has no access to the training data -- only to the generated data.", "We propose a technique to visualize individual convolutional layers in the classifier that yields highly informative time-series data for each convolutional layer and apply it to unobserved test data.", "Using non-linear regression, we infer underlying distributions for each word which allows us to analyze both absolute values and shapes of individual words at different convolutional layers, as well as perform hypothesis testing on their acoustic properties.", "The technique also allows us to test individual phone contrasts and how they are represented at each layer."]}
{"id": "http://arxiv.org/abs/2111.01690", "title": "Recent Advances in End-to-End Automatic Speech Recognition.", "authors": "Jinyu Li", "abstract": "Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.", "sentences": ["Recent Advances in End-to-End Automatic Speech Recognition.", "Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR).", "While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time.", "There are lots of practical factors that affect the production model deployment decision.", "Traditional hybrid models, being optimized for production for decades, are usually good at these factors.", "Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized.", "In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective."]}
{"id": "http://arxiv.org/abs/2111.15379", "title": "Text classification problems via BERT embedding method and graph convolutional neural network.", "authors": "Loc Hoang Tran, Tuan Tran, An Mai", "abstract": "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network. This combination is employed to solve the text classification problem. Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector. Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels. Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models.", "sentences": ["Text classification problems via BERT embedding method and graph convolutional neural network.", "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network.", "This combination is employed to solve the text classification problem.", "Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector.", "Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels.", "Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models."]}
{"id": "http://arxiv.org/abs/2112.03638", "title": "Scaling Structured Inference with Randomization.", "authors": "Yao Fu, John P. Cunningham, Mirella Lapata", "abstract": "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity. At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums. Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude. We further achieve low bias and variance via Rao-Blackwellization and importance sampling. Experiments over different graphs demonstrate the accuracy and efficiency of our approach. Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse", "sentences": ["Scaling Structured Inference with Randomization.", "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity.", "At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums.", "Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states.", "Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs).", "It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers.", "Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude.", "We further achieve low bias and variance via Rao-Blackwellization and importance sampling.", "Experiments over different graphs demonstrate the accuracy and efficiency of our approach.", "Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse"]}
{"id": "http://arxiv.org/abs/2202.01208", "title": "Deep Learning for Ultrasound Speed-of-Sound Reconstruction: Impacts of Training Data Diversity on Stability and Robustness.", "authors": "Farnaz Khun Jush, Markus Biele, Peter M. Dueppenbecker, Andreas Maier", "abstract": "Ultrasound b-mode imaging is a qualitative approach and diagnostic quality strongly depends on operators' training and experience. Quantitative approaches can provide information about tissue properties; therefore, can be used for identifying various tissue types, e.g., speed-of-sound in the tissue can be used as a biomarker for tissue malignancy, especially in breast imaging. Recent studies showed the possibility of speed-of-sound reconstruction using deep neural networks that are fully trained on simulated data. However, because of the ever present domain shift between simulated and measured data, the stability and performance of these models in real setups are still under debate. In this study, we investigated the impacts of training data diversity on the robustness of these networks by using multiple kinds of geometrical and natural simulated phantom structures. On the simulated data, we investigated the performance of the networks on out-of-domain echogenicity, geometries, and in the presence of noise. We further inspected the stability of employing such tissue modeling in a real data acquisition setup. We demonstrated that training the network with a joint set of datasets including both geometrical and natural tissue models improves the stability of the predicted speed-of-sound values both on simulated and measured data.", "sentences": ["Deep Learning for Ultrasound Speed-of-Sound Reconstruction: Impacts of Training Data Diversity on Stability and Robustness.", "Ultrasound b-mode imaging is a qualitative approach and diagnostic quality strongly depends on operators' training and experience.", "Quantitative approaches can provide information about tissue properties; therefore, can be used for identifying various tissue types, e.g., speed-of-sound in the tissue can be used as a biomarker for tissue malignancy, especially in breast imaging.", "Recent studies showed the possibility of speed-of-sound reconstruction using deep neural networks that are fully trained on simulated data.", "However, because of the ever present domain shift between simulated and measured data, the stability and performance of these models in real setups are still under debate.", "In this study, we investigated the impacts of training data diversity on the robustness of these networks by using multiple kinds of geometrical and natural simulated phantom structures.", "On the simulated data, we investigated the performance of the networks on out-of-domain echogenicity, geometries, and in the presence of noise.", "We further inspected the stability of employing such tissue modeling in a real data acquisition setup.", "We demonstrated that training the network with a joint set of datasets including both geometrical and natural tissue models improves the stability of the predicted speed-of-sound values both on simulated and measured data."]}
{"id": "http://arxiv.org/abs/2202.01212", "title": "Training Semantic Descriptors for Image-Based Localization.", "authors": "Ibrahim Cinaroglu, Yalin Bastanlar", "abstract": "Vision based solutions for the localization of vehicles have become popular recently. We employ an image retrieval based visual localization approach. The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image. We show that localization can be performed via descriptors solely extracted from semantically segmented images. It is reliable especially when the environment is subjected to severe illumination and seasonal changes. Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods.", "sentences": ["Training Semantic Descriptors for Image-Based Localization.", "Vision based solutions for the localization of vehicles have become popular recently.", "We employ an image retrieval based visual localization approach.", "The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image.", "We show that localization can be performed via descriptors solely extracted from semantically segmented images.", "It is reliable especially when the environment is subjected to severe illumination and seasonal changes.", "Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods."]}
{"id": "http://arxiv.org/abs/2202.01265", "title": "Automated processing of X-ray computed tomography images via panoptic segmentation for modeling woven composite textiles.", "authors": "Aaron Allred, Lauren J. Abbott, Alireza Doostan, Kurt Maute", "abstract": "A new, machine learning-based approach for automatically generating 3D digital geometries of woven composite textiles is proposed to overcome the limitations of existing analytical descriptions and segmentation methods. In this approach, panoptic segmentation is leveraged to produce instance segmented semantic masks from X-ray computed tomography (CT) images. This effort represents the first deep learning based automated process for segmenting unique yarn instances in a woven composite textile. Furthermore, it improves on existing methods by providing instance-level segmentation on low contrast CT datasets. Frame-to-frame instance tracking is accomplished via an intersection-over-union (IoU) approach adopted from video panoptic segmentation for assembling a 3D geometric model. A corrective recognition algorithm is developed to improve the recognition quality (RQ). The panoptic quality (PQ) metric is adopted to provide a new universal evaluation metric for reconstructed woven composite textiles. It is found that the panoptic segmentation network generalizes well to new CT images that are similar to the training set but does not extrapolate well to CT images of differing geometry, texture, and contrast. The utility of this approach is demonstrated by capturing yarn flow directions, contact regions between individual yarns, and the spatially varying cross-sectional areas of the yarns.", "sentences": ["Automated processing of X-ray computed tomography images via panoptic segmentation for modeling woven composite textiles.", "A new, machine learning-based approach for automatically generating 3D digital geometries of woven composite textiles is proposed to overcome the limitations of existing analytical descriptions and segmentation methods.", "In this approach, panoptic segmentation is leveraged to produce instance segmented semantic masks from X-ray computed tomography (CT) images.", "This effort represents the first deep learning based automated process for segmenting unique yarn instances in a woven composite textile.", "Furthermore, it improves on existing methods by providing instance-level segmentation on low contrast CT datasets.", "Frame-to-frame instance tracking is accomplished via an intersection-over-union (IoU) approach adopted from video panoptic segmentation for assembling a 3D geometric model.", "A corrective recognition algorithm is developed to improve the recognition quality (RQ).", "The panoptic quality (PQ) metric is adopted to provide a new universal evaluation metric for reconstructed woven composite textiles.", "It is found that the panoptic segmentation network generalizes well to new CT images that are similar to the training set but does not extrapolate well to CT images of differing geometry, texture, and contrast.", "The utility of this approach is demonstrated by capturing yarn flow directions, contact regions between individual yarns, and the spatially varying cross-sectional areas of the yarns."]}
{"id": "http://arxiv.org/abs/2202.01273", "title": "Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features.", "authors": "Zhaowei Zhu, Jialu Wang, Yang Liu", "abstract": "The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions. Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations. However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy. We empirically observe the failures of these approaches on a number of commonly used datasets. In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones. We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels. The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features. Code is available at github.com/UCSC-REAL/Est-T-MI.", "sentences": ["Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features.", "The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions.", "Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations.", "However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy.", "We empirically observe the failures of these approaches on a number of commonly used datasets.", "In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features.", "The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones.", "We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels.", "The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features.", "Code is available at github.com/UCSC-REAL/Est-T-MI."]}
{"id": "http://arxiv.org/abs/2202.01290", "title": "Cyclical Pruning for Sparse Neural Networks.", "authors": "Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, Tijmen Blankevoort", "abstract": "Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \\textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones. Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios. Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance.", "sentences": ["Cyclical Pruning for Sparse Neural Networks.", "Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy.", "In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights.", "To enable weight recovery, we propose a simple strategy called \\textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones.", "Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios.", "Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance."]}
{"id": "http://arxiv.org/abs/2202.01309", "title": "Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm.", "authors": "Hanieh Shabanian, Madhusudhanan Balasubramanian", "abstract": "A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene. These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene. Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views. We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates. The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene. The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures. When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check. The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps.", "sentences": ["Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm.", "A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene.", "These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene.", "Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views.", "We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates.", "The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene.", "The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures.", "When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check.", "The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps."]}
{"id": "http://arxiv.org/abs/2202.01323", "title": "PanoDepth: A Two-Stage Approach for Monocular Omnidirectional Depth Estimation.", "authors": "Yuyan Li, Zhixin Yan, Ye Duan, Liu Ren", "abstract": "Omnidirectional 3D information is essential for a wide range of applications such as Virtual Reality, Autonomous Driving, Robotics, etc. In this paper, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. In the second stage, we propose a differentiable Spherical Warping Layer to handle omnidirectional stereo geometry efficiently and effectively. By utilizing the explicit stereo-based geometric constraints in the stereo matching stage, PanoDepth can generate dense high-quality depth. We conducted extensive experiments and ablation studies to evaluate PanoDepth with both the full pipeline as well as the individual modules in each stage. Our results show that PanoDepth outperforms the state-of-the-art approaches by a large margin for 360 monocular depth estimation.", "sentences": ["PanoDepth: A Two-Stage Approach for Monocular Omnidirectional Depth Estimation.", "Omnidirectional 3D information is essential for a wide range of applications such as Virtual Reality, Autonomous Driving, Robotics, etc.", "In this paper, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation.", "Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage.", "In the second stage, we propose a differentiable Spherical Warping Layer to handle omnidirectional stereo geometry efficiently and effectively.", "By utilizing the explicit stereo-based geometric constraints in the stereo matching stage, PanoDepth can generate dense high-quality depth.", "We conducted extensive experiments and ablation studies to evaluate PanoDepth with both the full pipeline as well as the individual modules in each stage.", "Our results show that PanoDepth outperforms the state-of-the-art approaches by a large margin for 360 monocular depth estimation."]}
{"id": "http://arxiv.org/abs/2202.01337", "title": "Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls.", "authors": "Farhad Maleki, Katie Ovens, Rajiv Gupta, Caroline Reinhold, Alan Spatz, Reza Forghani", "abstract": "Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice. We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models. We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls. We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score. Our results showed that violation of the independence assumption could substantially affect model generalizability. More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance. We also observed that inappropriate performance indicators could lead to erroneous conclusions. Also, batch effect could lead to developing models that lack generalizability. The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance. These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations. Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models.", "sentences": ["Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls.", "Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice.", "We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models.", "We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls.", "We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score.", "Our results showed that violation of the independence assumption could substantially affect model generalizability.", "More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance.", "We also observed that inappropriate performance indicators could lead to erroneous conclusions.", "Also, batch effect could lead to developing models that lack generalizability.", "The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance.", "These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations.", "Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models."]}
{"id": "http://arxiv.org/abs/2202.01390", "title": "Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "authors": "Joachim Gudmundsson, Martin P. Seybold, John Pfeifer", "abstract": "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning. We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.  Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement. We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification. The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.  We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies. Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches.", "sentences": ["Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning.", "We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.", "Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement.", "We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification.", "The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.", "We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies.", "Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches."]}
{"id": "http://arxiv.org/abs/2202.01414", "title": "DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts.", "authors": "Wenzhen Zhu, Negin Sokhandan, Guang Yang, Sujitha Martin, Suchitra Sathyanarayana", "abstract": "Digitization of newspapers is of interest for many reasons including preservation of history, accessibility and search ability, etc. While digitization of documents such as scientific articles and magazines is prevalent in literature, one of the main challenges for digitization of newspaper lies in its complex layout (e.g. articles spanning multiple columns, text interrupted by images) analysis, which is necessary to preserve human read-order. This work provides a major breakthrough in the digitization of newspapers on three fronts: first, releasing a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states representing an extensive variety of complex layouts for document layout analysis; second, proposing layout segmentation as a precursor to existing optical character recognition (OCR) engines, where multiple state-of-the-art image segmentation models and several post-processing methods are explored for document layout segmentation; third, providing a thorough and structured evaluation protocol for isolated layout segmentation and end-to-end OCR.", "sentences": ["DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts.", "Digitization of newspapers is of interest for many reasons including preservation of history, accessibility and search ability, etc.", "While digitization of documents such as scientific articles and magazines is prevalent in literature, one of the main challenges for digitization of newspaper lies in its complex layout (e.g.", "articles spanning multiple columns, text interrupted by images) analysis, which is necessary to preserve human read-order.", "This work provides a major breakthrough in the digitization of newspapers on three fronts: first, releasing a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states representing an extensive variety of complex layouts for document layout analysis; second, proposing layout segmentation as a precursor to existing optical character recognition (OCR) engines, where multiple state-of-the-art image segmentation models and several post-processing methods are explored for document layout segmentation; third, providing a thorough and structured evaluation protocol for isolated layout segmentation and end-to-end OCR."]}
{"id": "http://arxiv.org/abs/2202.01421", "title": "Characterization of Semantic Segmentation Models on Mobile Platforms for Self-Navigation in Disaster-Struck Zones.", "authors": "Ryan Zelek, Hyeran Jeon", "abstract": "The role of unmanned vehicles for searching and localizing the victims in disaster impacted areas such as earthquake-struck zones is getting more important. Self-navigation on an earthquake zone has a unique challenge of detecting irregularly shaped obstacles such as road cracks, debris on the streets, and water puddles. In this paper, we characterize a number of state-of-the-art FCN models on mobile embedded platforms for self-navigation at these sites containing extremely irregular obstacles. We evaluate the models in terms of accuracy, performance, and energy efficiency. We present a few optimizations for our designed vision system. Lastly, we discuss the trade-offs of these models for a couple of mobile platforms that can each perform self-navigation. To enable vehicles to safely navigate earthquake-struck zones, we compiled a new annotated image database of various earthquake impacted regions that is different than traditional road damage databases. We train our database with a number of state-of-the-art semantic segmentation models in order to identify obstacles unique to earthquake-struck zones. Based on the statistics and tradeoffs, an optimal CNN model is selected for the mobile vehicular platforms, which we apply to both low-power and extremely low-power configurations of our design. To our best knowledge, this is the first study that identifies unique challenges and discusses the accuracy, performance, and energy impact of edge-based self-navigation mobile vehicles for earthquake-struck zones. Our proposed database and trained models are publicly available.", "sentences": ["Characterization of Semantic Segmentation Models on Mobile Platforms for Self-Navigation in Disaster-Struck Zones.", "The role of unmanned vehicles for searching and localizing the victims in disaster impacted areas such as earthquake-struck zones is getting more important.", "Self-navigation on an earthquake zone has a unique challenge of detecting irregularly shaped obstacles such as road cracks, debris on the streets, and water puddles.", "In this paper, we characterize a number of state-of-the-art FCN models on mobile embedded platforms for self-navigation at these sites containing extremely irregular obstacles.", "We evaluate the models in terms of accuracy, performance, and energy efficiency.", "We present a few optimizations for our designed vision system.", "Lastly, we discuss the trade-offs of these models for a couple of mobile platforms that can each perform self-navigation.", "To enable vehicles to safely navigate earthquake-struck zones, we compiled a new annotated image database of various earthquake impacted regions that is different than traditional road damage databases.", "We train our database with a number of state-of-the-art semantic segmentation models in order to identify obstacles unique to earthquake-struck zones.", "Based on the statistics and tradeoffs, an optimal CNN model is selected for the mobile vehicular platforms, which we apply to both low-power and extremely low-power configurations of our design.", "To our best knowledge, this is the first study that identifies unique challenges and discusses the accuracy, performance, and energy impact of edge-based self-navigation mobile vehicles for earthquake-struck zones.", "Our proposed database and trained models are publicly available."]}
{"id": "http://arxiv.org/abs/2202.01440", "title": "Optimized Potential Initialization for Low-latency Spiking Neural Networks.", "authors": "Tong Bu, Jianhao Ding, Zhaofei Yu, Tiejun Huang", "abstract": "Spiking Neural Networks (SNNs) have been attached great importance due to the distinctive properties of low power consumption, biological plausibility, and adversarial robustness. The most effective way to train deep SNNs is through ANN-to-SNN conversion, which have yielded the best performance in deep network structure and large-scale datasets. However, there is a trade-off between accuracy and latency. In order to achieve high precision as original ANNs, a long simulation time is needed to match the firing rate of a spiking neuron with the activation value of an analog neuron, which impedes the practical application of SNN. In this paper, we aim to achieve high-performance converted SNNs with extremely low latency (fewer than 32 time-steps). We start by theoretically analyzing ANN-to-SNN conversion and show that scaling the thresholds does play a similar role as weight normalization. Instead of introducing constraints that facilitate ANN-to-SNN conversion at the cost of model capacity, we applied a more direct way by optimizing the initial membrane potential to reduce the conversion loss in each layer. Besides, we demonstrate that optimal initialization of membrane potentials can implement expected error-free ANN-to-SNN conversion. We evaluate our algorithm on the CIFAR-10, CIFAR-100 and ImageNet datasets and achieve state-of-the-art accuracy, using fewer time-steps. For example, we reach top-1 accuracy of 93.38\\% on CIFAR-10 with 16 time-steps. Moreover, our method can be applied to other ANN-SNN conversion methodologies and remarkably promote performance when the time-steps is small.", "sentences": ["Optimized Potential Initialization for Low-latency Spiking Neural Networks.", "Spiking Neural Networks (SNNs) have been attached great importance due to the distinctive properties of low power consumption, biological plausibility, and adversarial robustness.", "The most effective way to train deep SNNs is through ANN-to-SNN conversion, which have yielded the best performance in deep network structure and large-scale datasets.", "However, there is a trade-off between accuracy and latency.", "In order to achieve high precision as original ANNs, a long simulation time is needed to match the firing rate of a spiking neuron with the activation value of an analog neuron, which impedes the practical application of SNN.", "In this paper, we aim to achieve high-performance converted SNNs with extremely low latency (fewer than 32 time-steps).", "We start by theoretically analyzing ANN-to-SNN conversion and show that scaling the thresholds does play a similar role as weight normalization.", "Instead of introducing constraints that facilitate ANN-to-SNN conversion at the cost of model capacity, we applied a more direct way by optimizing the initial membrane potential to reduce the conversion loss in each layer.", "Besides, we demonstrate that optimal initialization of membrane potentials can implement expected error-free ANN-to-SNN conversion.", "We evaluate our algorithm on the CIFAR-10, CIFAR-100 and ImageNet datasets and achieve state-of-the-art accuracy, using fewer time-steps.", "For example, we reach top-1 accuracy of 93.38\\% on CIFAR-10 with 16 time-steps.", "Moreover, our method can be applied to other ANN-SNN conversion methodologies and remarkably promote performance when the time-steps is small."]}
{"id": "http://arxiv.org/abs/2202.01459", "title": "Concept Bottleneck Model with Additional Unsupervised Concepts.", "authors": "Yoshihide Sawada, Keigo Nakamura", "abstract": "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications. However, most methods utilize post-hoc approaches rather than training the interpretable model. In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM). CBM uses concept labels to train an intermediate layer as the additional visible layer. However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels. To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs). By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images. We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC). We experimentally confirmed that the proposed model outperformed CBM and SENN. We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings.", "sentences": ["Concept Bottleneck Model with Additional Unsupervised Concepts.", "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications.", "However, most methods utilize post-hoc approaches rather than training the interpretable model.", "In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM).", "CBM uses concept labels to train an intermediate layer as the additional visible layer.", "However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels.", "To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs).", "By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images.", "We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC).", "We experimentally confirmed that the proposed model outperformed CBM and SENN.", "We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings."]}
{"id": "http://arxiv.org/abs/2202.01470", "title": "Boosting Monocular Depth Estimation with Sparse Guided Points.", "authors": "Guangkai Xu, Wei Yin, Hao Chen, Kai Cheng, Feng Zhao, Chunhua Shen", "abstract": "Existing monocular depth estimation shows excellent robustness in the wild, but the affine-invariant prediction requires aligning with the ground truth globally while being converted into the metric depth. In this work, we firstly propose a modified locally weighted linear regression strategy to leverage sparse ground truth and generate a flexible depth transformation to correct the coarse misalignment brought by global recovery strategy. Applying this strategy, we achieve significant improvement (more than 50% at most) over most recent state-of-the-art methods on five zero-shot datasets. Moreover, we train a robust depth estimation model with 6.3 million data and analyze the training process by decoupling the inaccuracy into coarse misalignment inaccuracy and detail missing inaccuracy. As a result, our model based on ResNet50 even outperforms the state-of-the-art DPT ViT-Large model with the help of our recovery strategy. In addition to accuracy, the consistency is also boosted for simple per-frame video depth estimation. Compared with monocular depth estimation, robust video depth estimation, and depth completion methods, our pipeline obtains state-of-the-art performance on video depth estimation without any post-processing. Experiments of 3D scene reconstruction from consistent video depth are conducted for intuitive comparison as well.", "sentences": ["Boosting Monocular Depth Estimation with Sparse Guided Points.", "Existing monocular depth estimation shows excellent robustness in the wild, but the affine-invariant prediction requires aligning with the ground truth globally while being converted into the metric depth.", "In this work, we firstly propose a modified locally weighted linear regression strategy to leverage sparse ground truth and generate a flexible depth transformation to correct the coarse misalignment brought by global recovery strategy.", "Applying this strategy, we achieve significant improvement (more than 50% at most) over most recent state-of-the-art methods on five zero-shot datasets.", "Moreover, we train a robust depth estimation model with 6.3 million data and analyze the training process by decoupling the inaccuracy into coarse misalignment inaccuracy and detail missing inaccuracy.", "As a result, our model based on ResNet50 even outperforms the state-of-the-art DPT ViT-Large model with the help of our recovery strategy.", "In addition to accuracy, the consistency is also boosted for simple per-frame video depth estimation.", "Compared with monocular depth estimation, robust video depth estimation, and depth completion methods, our pipeline obtains state-of-the-art performance on video depth estimation without any post-processing.", "Experiments of 3D scene reconstruction from consistent video depth are conducted for intuitive comparison as well."]}
{"id": "http://arxiv.org/abs/2202.01478", "title": "Trajectory Forecasting from Detection with Uncertainty-Aware Motion Encoding.", "authors": "Pu Zhang, Lei Bai, Jianru Xue, Jianwu Fang, Nanning Zheng, Wanli Ouyang", "abstract": "Trajectory forecasting is critical for autonomous platforms to make safe planning and actions. Currently, most trajectory forecasting methods assume that object trajectories have been extracted and directly develop trajectory predictors based on the ground truth trajectories. However, this assumption does not hold in practical situations. Trajectories obtained from object detection and tracking are inevitably noisy, which could cause serious forecasting errors to predictors built on ground truth trajectories. In this paper, we propose a trajectory predictor directly based on detection results without relying on explicitly formed trajectories. Different from the traditional methods which encode the motion cue of an agent based on its clearly defined trajectory, we extract the motion information only based on the affinity cues among detection results, in which an affinity-aware state update mechanism is designed to take the uncertainty of association into account. In addition, considering that there could be multiple plausible matching candidates, we aggregate the states of them. This design relaxes the undesirable effect of noisy trajectory obtained from data association. Extensive ablation experiments validate the effectiveness of our method and its generalization ability on different detectors. Cross-comparison to other forecasting schemes further proves the superiority of our method. Code will be released upon acceptance.", "sentences": ["Trajectory Forecasting from Detection with Uncertainty-Aware Motion Encoding.", "Trajectory forecasting is critical for autonomous platforms to make safe planning and actions.", "Currently, most trajectory forecasting methods assume that object trajectories have been extracted and directly develop trajectory predictors based on the ground truth trajectories.", "However, this assumption does not hold in practical situations.", "Trajectories obtained from object detection and tracking are inevitably noisy, which could cause serious forecasting errors to predictors built on ground truth trajectories.", "In this paper, we propose a trajectory predictor directly based on detection results without relying on explicitly formed trajectories.", "Different from the traditional methods which encode the motion cue of an agent based on its clearly defined trajectory, we extract the motion information only based on the affinity cues among detection results, in which an affinity-aware state update mechanism is designed to take the uncertainty of association into account.", "In addition, considering that there could be multiple plausible matching candidates, we aggregate the states of them.", "This design relaxes the undesirable effect of noisy trajectory obtained from data association.", "Extensive ablation experiments validate the effectiveness of our method and its generalization ability on different detectors.", "Cross-comparison to other forecasting schemes further proves the superiority of our method.", "Code will be released upon acceptance."]}
{"id": "http://arxiv.org/abs/2202.01493", "title": "Spatial Computing and Intuitive Interaction: Bringing Mixed Reality and Robotics Together.", "authors": "Jeffrey Delmerico, Roi Poranne, Federica Bogo, Helen Oleynikova, Eric Vollenweider, Stelian Coros, Juan Nieto, Marc Pollefeys", "abstract": "Spatial computing -- the ability of devices to be aware of their surroundings and to represent this digitally -- offers novel capabilities in human-robot interaction. In particular, the combination of spatial computing and egocentric sensing on mixed reality devices enables them to capture and understand human actions and translate these to actions with spatial meaning, which offers exciting new possibilities for collaboration between humans and robots. This paper presents several human-robot systems that utilize these capabilities to enable novel robot use cases: mission planning for inspection, gesture-based control, and immersive teleoperation. These works demonstrate the power of mixed reality as a tool for human-robot interaction, and the potential of spatial computing and mixed reality to drive the future of human-robot interaction.", "sentences": ["Spatial Computing and Intuitive Interaction: Bringing Mixed Reality and Robotics Together.", "Spatial computing -- the ability of devices to be aware of their surroundings and to represent this digitally -- offers novel capabilities in human-robot interaction.", "In particular, the combination of spatial computing and egocentric sensing on mixed reality devices enables them to capture and understand human actions and translate these to actions with spatial meaning, which offers exciting new possibilities for collaboration between humans and robots.", "This paper presents several human-robot systems that utilize these capabilities to enable novel robot use cases: mission planning for inspection, gesture-based control, and immersive teleoperation.", "These works demonstrate the power of mixed reality as a tool for human-robot interaction, and the potential of spatial computing and mixed reality to drive the future of human-robot interaction."]}
{"id": "http://arxiv.org/abs/2202.01494", "title": "PARCEL: Physics-based unsupervised contrastive representation learning for parallel MR imaging.", "authors": "Shanshan Wang, Ruoyou Wu, Cheng Li, Juan Zou, Hairong Zheng", "abstract": "With the successful application of deep learning in magnetic resonance imaging, parallel imaging techniques based on neural networks have attracted wide attentions. However, without high-quality fully sampled datasets for training, the performance of these methods tends to be limited. To address this issue, this paper proposes a physics based unsupervised contrastive representation learning (PARCEL) method to speed up parallel MR imaging. Specifically, PARCEL has three key ingredients to achieve direct deep learning from the undersampled k-space data. Namely, a parallel framework has been developed by learning two branches of model-based networks unrolled with the conjugate gradient algorithm; Augmented undersampled k-space data randomly drawn from the obtained k-space data are used to help the parallel network to capture the detailed information. A specially designed co-training loss is designed to guide the two networks to capture the inherent features and representations of the-to-be-reconstructed MR image. The proposed method has been evaluated on in vivo datasets and compared to five state-of-the-art methods, whose results show PARCEL is able to learn useful representations for more accurate MR reconstructions without the reliance on the fully-sampled datasets.", "sentences": ["PARCEL: Physics-based unsupervised contrastive representation learning for parallel MR imaging.", "With the successful application of deep learning in magnetic resonance imaging, parallel imaging techniques based on neural networks have attracted wide attentions.", "However, without high-quality fully sampled datasets for training, the performance of these methods tends to be limited.", "To address this issue, this paper proposes a physics based unsupervised contrastive representation learning (PARCEL) method to speed up parallel MR imaging.", "Specifically, PARCEL has three key ingredients to achieve direct deep learning from the undersampled k-space data.", "Namely, a parallel framework has been developed by learning two branches of model-based networks unrolled with the conjugate gradient algorithm; Augmented undersampled k-space data randomly drawn from the obtained k-space data are used to help the parallel network to capture the detailed information.", "A specially designed co-training loss is designed to guide the two networks to capture the inherent features and representations of the-to-be-reconstructed MR image.", "The proposed method has been evaluated on in vivo datasets and compared to five state-of-the-art methods, whose results show PARCEL is able to learn useful representations for more accurate MR reconstructions without the reliance on the fully-sampled datasets."]}
{"id": "http://arxiv.org/abs/2202.01537", "title": "Bending Graphs: Hierarchical Shape Matching using Gated Optimal Transport.", "authors": "Mahdi Saleh, Shun-Cheng Wu, Luca Cosmo, Nassir Navab, Benjamin Busam, Federico Tombari", "abstract": "Shape matching has been a long-studied problem for the computer graphics and vision community. The objective is to predict a dense correspondence between meshes that have a certain degree of deformation. Existing methods either consider the local description of sampled points or discover correspondences based on global shape information. In this work, we investigate a hierarchical learning design, to which we incorporate local patch-level information and global shape-level structures. This flexible representation enables correspondence prediction and provides rich features for the matching stage. Finally, we propose a novel optimal transport solver by recurrently updating features on non-confident nodes to learn globally consistent correspondences between the shapes. Our results on publicly available datasets suggest robust performance in presence of severe deformations without the need for extensive training or refinement.", "sentences": ["Bending Graphs: Hierarchical Shape Matching using Gated Optimal Transport.", "Shape matching has been a long-studied problem for the computer graphics and vision community.", "The objective is to predict a dense correspondence between meshes that have a certain degree of deformation.", "Existing methods either consider the local description of sampled points or discover correspondences based on global shape information.", "In this work, we investigate a hierarchical learning design, to which we incorporate local patch-level information and global shape-level structures.", "This flexible representation enables correspondence prediction and provides rich features for the matching stage.", "Finally, we propose a novel optimal transport solver by recurrently updating features on non-confident nodes to learn globally consistent correspondences between the shapes.", "Our results on publicly available datasets suggest robust performance in presence of severe deformations without the need for extensive training or refinement."]}
{"id": "http://arxiv.org/abs/2202.01564", "title": "Weakly Supervised Nuclei Segmentation via Instance Learning.", "authors": "Weizhen Liu, Qian He, Xuming He", "abstract": "Weakly supervised nuclei segmentation is a critical problem for pathological image analysis and greatly benefits the community due to the significant reduction of labeling cost. Adopting point annotations, previous methods mostly rely on less expressive representations for nuclei instances and thus have difficulty in handling crowded nuclei. In this paper, we propose to decouple weakly supervised semantic and instance segmentation in order to enable more effective subtask learning and to promote instance-aware representation learning. To achieve this, we design a modular deep network with two branches: a semantic proposal network and an instance encoding network, which are trained in a two-stage manner with an instance-sensitive loss. Empirical results show that our approach achieves the state-of-the-art performance on two public benchmarks of pathological images from different types of organs.", "sentences": ["Weakly Supervised Nuclei Segmentation via Instance Learning.", "Weakly supervised nuclei segmentation is a critical problem for pathological image analysis and greatly benefits the community due to the significant reduction of labeling cost.", "Adopting point annotations, previous methods mostly rely on less expressive representations for nuclei instances and thus have difficulty in handling crowded nuclei.", "In this paper, we propose to decouple weakly supervised semantic and instance segmentation in order to enable more effective subtask learning and to promote instance-aware representation learning.", "To achieve this, we design a modular deep network with two branches: a semantic proposal network and an instance encoding network, which are trained in a two-stage manner with an instance-sensitive loss.", "Empirical results show that our approach achieves the state-of-the-art performance on two public benchmarks of pathological images from different types of organs."]}
{"id": "http://arxiv.org/abs/2202.01719", "title": "FORML: Learning to Reweight Data for Fairness.", "authors": "Bobby Yan, Skyler Seto, Nicholas Apostoloff", "abstract": "Deployed machine learning models are evaluated by multiple metrics beyond accuracy, such as fairness and robustness. However, such models are typically trained to minimize the average loss for a single metric, which is typically a proxy for accuracy. Training to optimize a single metric leaves these models prone to fairness violations, especially when the population of sub-groups in the training data are imbalanced. This work addresses the challenge of jointly optimizing fairness and predictive performance in the multi-class classification setting by introducing Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters. The approach increases fairness by learning to weight each training datum's contribution to the loss according to its impact on reducing fairness violations, balancing the contributions from both over- and under-represented sub-groups. We empirically validate FORML on a range of benchmark and real-world classification datasets and show that our approach improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face attribute prediction task. This improvement is achieved without pre-processing data or post-processing model outputs, without learning an additional weighting function, and while maintaining accuracy on the original predictive metric.", "sentences": ["FORML: Learning to Reweight Data for Fairness.", "Deployed machine learning models are evaluated by multiple metrics beyond accuracy, such as fairness and robustness.", "However, such models are typically trained to minimize the average loss for a single metric, which is typically a proxy for accuracy.", "Training to optimize a single metric leaves these models prone to fairness violations, especially when the population of sub-groups in the training data are imbalanced.", "This work addresses the challenge of jointly optimizing fairness and predictive performance in the multi-class classification setting by introducing Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters.", "The approach increases fairness by learning to weight each training datum's contribution to the loss according to its impact on reducing fairness violations, balancing the contributions from both over- and under-represented sub-groups.", "We empirically validate FORML on a range of benchmark and real-world classification datasets and show that our approach improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face attribute prediction task.", "This improvement is achieved without pre-processing data or post-processing model outputs, without learning an additional weighting function, and while maintaining accuracy on the original predictive metric."]}
{"id": "http://arxiv.org/abs/2202.01727", "title": "Skeleton-Based Action Segmentation with Multi-Stage Spatial-Temporal Graph Convolutional Neural Networks.", "authors": "Benjamin Filtjens, Bart Vanrumste, Peter Slaets", "abstract": "The ability to identify and temporally segment fine-grained actions in motion capture sequences is crucial for applications in human movement analysis. Motion capture is typically performed with optical or inertial measurement systems, which encode human movement as a time series of human joint locations and orientations or their higher-order representations. State-of-the-art action segmentation approaches use multiple stages of temporal convolutions. The main idea is to generate an initial prediction with several layers of temporal convolutions and refine these predictions over multiple stages, also with temporal convolutions. Although these approaches capture long-term temporal patterns, the initial predictions do not adequately consider the spatial hierarchy among the human joints. To address this limitation, we present multi-stage spatial-temporal graph convolutional neural networks (MS-GCN). Our framework decouples the architecture of the initial prediction generation stage from the refinement stages. Specifically, we replace the initial stage of temporal convolutions with spatial-temporal graph convolutions, which better exploit the spatial configuration of the joints and their temporal dynamics. Our framework was compared to four strong baselines on five tasks. Experimental results demonstrate that our framework achieves state-of-the-art performance.", "sentences": ["Skeleton-Based Action Segmentation with Multi-Stage Spatial-Temporal Graph Convolutional Neural Networks.", "The ability to identify and temporally segment fine-grained actions in motion capture sequences is crucial for applications in human movement analysis.", "Motion capture is typically performed with optical or inertial measurement systems, which encode human movement as a time series of human joint locations and orientations or their higher-order representations.", "State-of-the-art action segmentation approaches use multiple stages of temporal convolutions.", "The main idea is to generate an initial prediction with several layers of temporal convolutions and refine these predictions over multiple stages, also with temporal convolutions.", "Although these approaches capture long-term temporal patterns, the initial predictions do not adequately consider the spatial hierarchy among the human joints.", "To address this limitation, we present multi-stage spatial-temporal graph convolutional neural networks (MS-GCN).", "Our framework decouples the architecture of the initial prediction generation stage from the refinement stages.", "Specifically, we replace the initial stage of temporal convolutions with spatial-temporal graph convolutions, which better exploit the spatial configuration of the joints and their temporal dynamics.", "Our framework was compared to four strong baselines on five tasks.", "Experimental results demonstrate that our framework achieves state-of-the-art performance."]}
{"id": "http://arxiv.org/abs/2202.01731", "title": "Fast Online Video Super-Resolution with Deformable Attention Pyramid.", "authors": "Dario Fuoli, Martin Danelljan, Radu Timofte, Luc Van Gool", "abstract": "Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV. We address the VSR problem under these settings, which poses additional important challenges since information from future frames are unavailable. Importantly, designing efficient, yet effective frame alignment and fusion modules remain central problems. In this work, we propose a recurrent VSR architecture based on a deformable attention pyramid (DAP). Our DAP aligns and integrates information from the recurrent state into the current frame prediction. To circumvent the computational cost of traditional attention-based methods, we only attend to a limited number of spatial locations, which are dynamically predicted by the DAP. Comprehensive experiments and analysis of the proposed key innovations show the effectiveness of our approach. We significantly reduce processing time in comparison to state-of-the-art methods, while maintaining a high performance. We surpass state-of-the-art method EDVR-M on two standard benchmarks with a speed-up of over 3x.", "sentences": ["Fast Online Video Super-Resolution with Deformable Attention Pyramid.", "Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.", "We address the VSR problem under these settings, which poses additional important challenges since information from future frames are unavailable.", "Importantly, designing efficient, yet effective frame alignment and fusion modules remain central problems.", "In this work, we propose a recurrent VSR architecture based on a deformable attention pyramid (DAP).", "Our DAP aligns and integrates information from the recurrent state into the current frame prediction.", "To circumvent the computational cost of traditional attention-based methods, we only attend to a limited number of spatial locations, which are dynamically predicted by the DAP.", "Comprehensive experiments and analysis of the proposed key innovations show the effectiveness of our approach.", "We significantly reduce processing time in comparison to state-of-the-art methods, while maintaining a high performance.", "We surpass state-of-the-art method EDVR-M on two standard benchmarks with a speed-up of over 3x."]}
{"id": "http://arxiv.org/abs/2202.01747", "title": "The Met Dataset: Instance-level Recognition for Artworks.", "authors": "Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi, Nanne Van Noord, Giorgos Tolias", "abstract": "This work introduces a dataset for large-scale instance-level recognition in the domain of artworks. The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes. We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhibit with photos taken under studio conditions. Testing is primarily performed on photos taken by museum guests depicting exhibits, which introduces a distribution shift between training and testing. Testing is additionally performed on a set of images not related to Met exhibits making the task resemble an out-of-distribution detection problem. The proposed benchmark follows the paradigm of other recent datasets for instance-level recognition on different domains to encourage research on domain independent approaches. A number of suitable approaches are evaluated to offer a testbed for future comparisons. Self-supervised and supervised contrastive learning are effectively combined to train the backbone which is used for non-parametric classification that is shown as a promising direction. Dataset webpage: this http URL", "sentences": ["The Met Dataset: Instance-level Recognition for Artworks.", "This work introduces a dataset for large-scale instance-level recognition in the domain of artworks.", "The proposed benchmark exhibits a number of different challenges such as large inter-class similarity, long tail distribution, and many classes.", "We rely on the open access collection of The Met museum to form a large training set of about 224k classes, where each class corresponds to a museum exhibit with photos taken under studio conditions.", "Testing is primarily performed on photos taken by museum guests depicting exhibits, which introduces a distribution shift between training and testing.", "Testing is additionally performed on a set of images not related to Met exhibits making the task resemble an out-of-distribution detection problem.", "The proposed benchmark follows the paradigm of other recent datasets for instance-level recognition on different domains to encourage research on domain independent approaches.", "A number of suitable approaches are evaluated to offer a testbed for future comparisons.", "Self-supervised and supervised contrastive learning are effectively combined to train the backbone which is used for non-parametric classification that is shown as a promising direction.", "Dataset webpage: this http URL"]}
{"id": "http://arxiv.org/abs/1912.04608", "title": "Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.", "authors": "Yan Bin Ng, Basura Fernando", "abstract": "Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the objective is to forecast the correct future symbolic action sequence. Unlike prior methods that make action predictions for some unseen percentage of video one for each frame, we predict the complete action sequence that is required to accomplish the activity. We coin this task action sequence forecasting. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to improve results.  We extend our action sequence forecasting model to perform weakly supervised action forecasting on two challenging datasets, the Breakfast and the 50Salads. Specifically, we propose a model to predict actions of future unseen frames without using frame level annotations during training. Using Fisher vector features, our supervised model outperforms the state-of-the-art action forecasting model by 0.83% and 7.09% on the Breakfast and the 50Salads datasets respectively. Our weakly supervised model is only 0.6% behind the most recent state-of-the-art supervised model and obtains comparable results to other published fully supervised methods, and sometimes even outperforms them on the Breakfast dataset. Most interestingly, our weakly supervised model outperforms prior models by 1.04% leveraging on proposed weakly supervised architecture, and effective use of attention mechanism and loss functions.", "sentences": ["Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.", "Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security.", "We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture.", "The input to this model is the observed RGB video, and the objective is to forecast the correct future symbolic action sequence.", "Unlike prior methods that make action predictions for some unseen percentage of video one for each frame, we predict the complete action sequence that is required to accomplish the activity.", "We coin this task action sequence forecasting.", "To cater for two types of uncertainty in the future predictions, we propose a novel loss function.", "We show a combination of optimal transport and future uncertainty losses help to improve results.", "We extend our action sequence forecasting model to perform weakly supervised action forecasting on two challenging datasets, the Breakfast and the 50Salads.", "Specifically, we propose a model to predict actions of future unseen frames without using frame level annotations during training.", "Using Fisher vector features, our supervised model outperforms the state-of-the-art action forecasting model by 0.83% and 7.09% on the Breakfast and the 50Salads datasets respectively.", "Our weakly supervised model is only 0.6% behind the most recent state-of-the-art supervised model and obtains comparable results to other published fully supervised methods, and sometimes even outperforms them on the Breakfast dataset.", "Most interestingly, our weakly supervised model outperforms prior models by 1.04% leveraging on proposed weakly supervised architecture, and effective use of attention mechanism and loss functions."]}
{"id": "http://arxiv.org/abs/2009.09205", "title": "Adversarial Rain Attack and Defensive Deraining for DNN Perception.", "authors": "Liming Zhai, Felix Juefei-Xu, Qing Guo, Xiaofei Xie, Lei Ma, Wei Feng, Shengchao Qin, Yang Liu", "abstract": "Rain often poses inevitable threats to deep neural network (DNN) based perception systems, and a comprehensive investigation of the potential risks of the rain to DNNs is of great importance. However, it is rather difficult to collect or synthesize rainy images that can represent all rain situations that would possibly occur in the real world. To this end, in this paper, we start from a new perspective and propose to combine two totally different studies, i.e., rainy image synthesis and adversarial attack. We first present an adversarial rain attack, with which we could simulate various rain situations with the guidance of deployed DNNs and reveal the potential threat factors that can be brought by rain. In particular, we design a factor-aware rain generation that synthesizes rain streaks according to the camera exposure process and models the learnable rain factors for adversarial attack. With this generator, we perform the adversarial rain attack against the image classification and object detection. To defend the DNNs from the negative rain effect, we also present a defensive deraining strategy, for which we design an adversarial rain augmentation that uses mixed adversarial rain layers to enhance deraining models for downstream DNN perception. Our large-scale evaluation on various datasets demonstrates that our synthesized rainy images with realistic appearances not only exhibit strong adversarial capability against DNNs, but also boost the deraining models for defensive purposes, building the foundation for further rain-robust perception studies.", "sentences": ["Adversarial Rain Attack and Defensive Deraining for DNN Perception.", "Rain often poses inevitable threats to deep neural network (DNN) based perception systems, and a comprehensive investigation of the potential risks of the rain to DNNs is of great importance.", "However, it is rather difficult to collect or synthesize rainy images that can represent all rain situations that would possibly occur in the real world.", "To this end, in this paper, we start from a new perspective and propose to combine two totally different studies, i.e., rainy image synthesis and adversarial attack.", "We first present an adversarial rain attack, with which we could simulate various rain situations with the guidance of deployed DNNs and reveal the potential threat factors that can be brought by rain.", "In particular, we design a factor-aware rain generation that synthesizes rain streaks according to the camera exposure process and models the learnable rain factors for adversarial attack.", "With this generator, we perform the adversarial rain attack against the image classification and object detection.", "To defend the DNNs from the negative rain effect, we also present a defensive deraining strategy, for which we design an adversarial rain augmentation that uses mixed adversarial rain layers to enhance deraining models for downstream DNN perception.", "Our large-scale evaluation on various datasets demonstrates that our synthesized rainy images with realistic appearances not only exhibit strong adversarial capability against DNNs, but also boost the deraining models for defensive purposes, building the foundation for further rain-robust perception studies."]}
{"id": "http://arxiv.org/abs/2102.02696", "title": "Active Boundary Loss for Semantic Segmentation.", "authors": "Chi Wang, Yunke Zhang, Miaomiao Cui, Peiran Ren, Yin Yang, Xuansong Xie, XianSheng Hua, Hujun Bao, Weiwei Xu", "abstract": "This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss. Based on the predicted boundaries detected from the segmentation results using current network parameters, we formulate the boundary alignment problem as a differentiable direction vector prediction problem to guide the movement of predicted boundaries in each iteration. Our loss is model-agnostic and can be plugged in to the training of segmentation networks to improve the boundary details. Experimental results show that training with the active boundary loss can effectively improve the boundary F-score and mean Intersection-over-Union on challenging image and video object segmentation datasets.", "sentences": ["Active Boundary Loss for Semantic Segmentation.", "This paper proposes a novel active boundary loss for semantic segmentation.", "It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss.", "Based on the predicted boundaries detected from the segmentation results using current network parameters, we formulate the boundary alignment problem as a differentiable direction vector prediction problem to guide the movement of predicted boundaries in each iteration.", "Our loss is model-agnostic and can be plugged in to the training of segmentation networks to improve the boundary details.", "Experimental results show that training with the active boundary loss can effectively improve the boundary F-score and mean Intersection-over-Union on challenging image and video object segmentation datasets."]}
{"id": "http://arxiv.org/abs/2103.15449", "title": "Automated freezing of gait assessment with marker-based motion capture and multi-stage spatial-temporal graph convolutional neural networks.", "authors": "Benjamin Filtjens, Pieter Ginis, Alice Nieuwboer, Peter Slaets, Bart Vanrumste", "abstract": "Freezing of gait (FOG) is a common and debilitating gait impairment in Parkinson's disease. Further insight into this phenomenon is hampered by the difficulty to objectively assess FOG. To meet this clinical need, this paper proposes an automated motion-capture-based FOG assessment method driven by a novel deep neural network. Automated FOG assessment can be formulated as an action segmentation problem, where temporal models are tasked to recognize and temporally localize the FOG segments in untrimmed motion capture trials. This paper takes a closer look at the performance of state-of-the-art action segmentation models when tasked to automatically assess FOG. Furthermore, a novel deep neural network architecture is proposed that aims to better capture the spatial and temporal dependencies than the state-of-the-art baselines. The proposed network, termed multi-stage spatial-temporal graph convolutional network (MS-GCN), combines the spatial-temporal graph convolutional network (ST-GCN) and the multi-stage temporal convolutional network (MS-TCN). The ST-GCN captures the hierarchical spatial-temporal motion among the joints inherent to motion capture, while the multi-stage component reduces over-segmentation errors by refining the predictions over multiple stages. The experiments indicate that the proposed model outperforms four state-of-the-art baselines. Moreover, FOG outcomes derived from MS-GCN predictions had an excellent (r=0.93 [0.87, 0.97]) and moderately strong (r=0.75 [0.55, 0.87]) linear relationship with FOG outcomes derived from manual annotations. The proposed MS-GCN may provide an automated and objective alternative to labor-intensive clinician-based FOG assessment. Future work is now possible that aims to assess the generalization of MS-GCN to a larger and more varied verification cohort.", "sentences": ["Automated freezing of gait assessment with marker-based motion capture and multi-stage spatial-temporal graph convolutional neural networks.", "Freezing of gait (FOG) is a common and debilitating gait impairment in Parkinson's disease.", "Further insight into this phenomenon is hampered by the difficulty to objectively assess FOG.", "To meet this clinical need, this paper proposes an automated motion-capture-based FOG assessment method driven by a novel deep neural network.", "Automated FOG assessment can be formulated as an action segmentation problem, where temporal models are tasked to recognize and temporally localize the FOG segments in untrimmed motion capture trials.", "This paper takes a closer look at the performance of state-of-the-art action segmentation models when tasked to automatically assess FOG.", "Furthermore, a novel deep neural network architecture is proposed that aims to better capture the spatial and temporal dependencies than the state-of-the-art baselines.", "The proposed network, termed multi-stage spatial-temporal graph convolutional network (MS-GCN), combines the spatial-temporal graph convolutional network (ST-GCN) and the multi-stage temporal convolutional network (MS-TCN).", "The ST-GCN captures the hierarchical spatial-temporal motion among the joints inherent to motion capture, while the multi-stage component reduces over-segmentation errors by refining the predictions over multiple stages.", "The experiments indicate that the proposed model outperforms four state-of-the-art baselines.", "Moreover, FOG outcomes derived from MS-GCN predictions had an excellent (r=0.93 [0.87, 0.97]) and moderately strong (r=0.75 [0.55, 0.87]) linear relationship with FOG outcomes derived from manual annotations.", "The proposed MS-GCN may provide an automated and objective alternative to labor-intensive clinician-based FOG assessment.", "Future work is now possible that aims to assess the generalization of MS-GCN to a larger and more varied verification cohort."]}
{"id": "http://arxiv.org/abs/2105.02103", "title": "Prototype Memory for Large-scale Face Representation Learning.", "authors": "Evgeny Smirnov, Nikita Garaev, Vasiliy Galyuk, Evgeny Lukyanets", "abstract": "Face representation learning using datasets with a massive number of identities requires appropriate training methods. Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons. Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation. These methods, however, have a set of disadvantages. One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals. This problem is especially serious in ultra-large-scale datasets. In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size. Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way. New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch. These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training. To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of. Prototype Memory is computationally efficient and independent of dataset size. It can be used with various loss functions, hard example mining algorithms and encoder architectures. We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks.", "sentences": ["Prototype Memory for Large-scale Face Representation Learning.", "Face representation learning using datasets with a massive number of identities requires appropriate training methods.", "Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons.", "Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation.", "These methods, however, have a set of disadvantages.", "One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals.", "This problem is especially serious in ultra-large-scale datasets.", "In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size.", "Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way.", "New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch.", "These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training.", "To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of.", "Prototype Memory is computationally efficient and independent of dataset size.", "It can be used with various loss functions, hard example mining algorithms and encoder architectures.", "We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks."]}
{"id": "http://arxiv.org/abs/2106.10887", "title": "Trust It or Not: Confidence-Guided Automatic Radiology Report Generation.", "authors": "Yixin Wang, Zihao Lin, Zhe Xu, Haoyu Dong, Jiang Tian, Jie Luo, Zhongchao Shi, Yang Zhang, Jianping Fan, Zhiqiang He", "abstract": "Medical imaging plays a pivotal role in diagnosis and treatment in clinical practice. Inspired by the significant progress in automatic image captioning, various deep learning (DL)-based methods have been proposed to generate radiology reports for medical images. Despite promising results, previous works overlook the uncertainties of their models and are thus unable to provide clinicians with the reliability/confidence of the generated radiology reports to assist their decision-making. In this paper, we propose a novel method to explicitly quantify both the visual uncertainty and the textual uncertainty for DL-based radiology report generation. Such multi-modal uncertainties can sufficiently capture the model confidence degree at both the report level and the sentence level, and thus they are further leveraged to weight the losses for more comprehensive model optimization. Experimental results have demonstrated that the proposed method for model uncertainty characterization and estimation can produce more reliable confidence scores for radiology report generation, and the modified loss function, which takes into account the uncertainties, leads to better model performance on two public radiology report datasets. In addition, the quality of the automatically generated reports was manually evaluated by human raters and the results also indicate that the proposed uncertainties can reflect the variance of clinical diagnosis.", "sentences": ["Trust It or Not: Confidence-Guided Automatic Radiology Report Generation.", "Medical imaging plays a pivotal role in diagnosis and treatment in clinical practice.", "Inspired by the significant progress in automatic image captioning, various deep learning (DL)-based methods have been proposed to generate radiology reports for medical images.", "Despite promising results, previous works overlook the uncertainties of their models and are thus unable to provide clinicians with the reliability/confidence of the generated radiology reports to assist their decision-making.", "In this paper, we propose a novel method to explicitly quantify both the visual uncertainty and the textual uncertainty for DL-based radiology report generation.", "Such multi-modal uncertainties can sufficiently capture the model confidence degree at both the report level and the sentence level, and thus they are further leveraged to weight the losses for more comprehensive model optimization.", "Experimental results have demonstrated that the proposed method for model uncertainty characterization and estimation can produce more reliable confidence scores for radiology report generation, and the modified loss function, which takes into account the uncertainties, leads to better model performance on two public radiology report datasets.", "In addition, the quality of the automatically generated reports was manually evaluated by human raters and the results also indicate that the proposed uncertainties can reflect the variance of clinical diagnosis."]}
{"id": "http://arxiv.org/abs/2107.02434", "title": "Self-Adversarial Training incorporating Forgery Attention for Image Forgery Localization.", "authors": "Long Zhuo, Shunquan Tan, Bin Li, Jiwu Huang", "abstract": "Image editing techniques enable people to modify the content of an image without leaving visual traces and thus may cause serious security risks. Hence the detection and localization of these forgeries become quite necessary and challenging. Furthermore, unlike other tasks with extensive data, there is usually a lack of annotated forged images for training due to annotation difficulties. In this paper, we propose a self-adversarial training strategy and a reliable coarse-to-fine network that utilizes a self-attention mechanism to localize forged regions in forgery images. The self-attention module is based on a Channel-Wise High Pass Filter block (CW-HPF). CW-HPF leverages inter-channel relationships of features and extracts noise features by high pass filters. Based on the CW-HPF, a self-attention mechanism, called forgery attention, is proposed to capture rich contextual dependencies of intrinsic inconsistency extracted from tampered regions. Specifically, we append two types of attention modules on top of CW-HPF respectively to model internal interdependencies in spatial dimension and external dependencies among channels. We exploit a coarse-to-fine network to enhance the noise inconsistency between original and tampered regions. More importantly, to address the issue of insufficient training data, we design a self-adversarial training strategy that expands training data dynamically to achieve more robust performance. Specifically, in each training iteration, we perform adversarial attacks against our network to generate adversarial examples and train our model on them. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.", "sentences": ["Self-Adversarial Training incorporating Forgery Attention for Image Forgery Localization.", "Image editing techniques enable people to modify the content of an image without leaving visual traces and thus may cause serious security risks.", "Hence the detection and localization of these forgeries become quite necessary and challenging.", "Furthermore, unlike other tasks with extensive data, there is usually a lack of annotated forged images for training due to annotation difficulties.", "In this paper, we propose a self-adversarial training strategy and a reliable coarse-to-fine network that utilizes a self-attention mechanism to localize forged regions in forgery images.", "The self-attention module is based on a Channel-Wise High Pass Filter block (CW-HPF).", "CW-HPF leverages inter-channel relationships of features and extracts noise features by high pass filters.", "Based on the CW-HPF, a self-attention mechanism, called forgery attention, is proposed to capture rich contextual dependencies of intrinsic inconsistency extracted from tampered regions.", "Specifically, we append two types of attention modules on top of CW-HPF respectively to model internal interdependencies in spatial dimension and external dependencies among channels.", "We exploit a coarse-to-fine network to enhance the noise inconsistency between original and tampered regions.", "More importantly, to address the issue of insufficient training data, we design a self-adversarial training strategy that expands training data dynamically to achieve more robust performance.", "Specifically, in each training iteration, we perform adversarial attacks against our network to generate adversarial examples and train our model on them.", "Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets."]}
{"id": "http://arxiv.org/abs/2107.07056", "title": "Learning Sparse Interaction Graphs of Partially Detected Pedestrians for Trajectory Prediction.", "authors": "Zhe Huang, Ruohua Li, Kazuki Shin, Katherine Driggs-Campbell", "abstract": "Multi-pedestrian trajectory prediction is an indispensable element of autonomous systems that safely interact with crowds in unstructured environments. Many recent efforts in trajectory prediction algorithms have focused on understanding social norms behind pedestrian motions. Yet we observe these works usually hold two assumptions, which prevent them from being smoothly applied to robot applications: (1) positions of all pedestrians are consistently tracked, and (2) the target agent pays attention to all pedestrians in the scene. The first assumption leads to biased interaction modeling with incomplete pedestrian data. The second assumption introduces aggregation of redundant surrounding information, and the target agent may be affected by unimportant neighbors or present overly conservative motion. Thus, we propose Gumbel Social Transformer, in which an Edge Gumbel Selector samples a sparse interaction graph of partially detected pedestrians at each time step. A Node Transformer Encoder and a Masked LSTM encode pedestrian features with sampled sparse graphs to predict trajectories. We demonstrate that our model overcomes potential problems caused by the aforementioned assumptions, and our approach outperforms related works in trajectory prediction benchmarks. Code is available at \\url{https://github.com/tedhuang96/gst}.", "sentences": ["Learning Sparse Interaction Graphs of Partially Detected Pedestrians for Trajectory Prediction.", "Multi-pedestrian trajectory prediction is an indispensable element of autonomous systems that safely interact with crowds in unstructured environments.", "Many recent efforts in trajectory prediction algorithms have focused on understanding social norms behind pedestrian motions.", "Yet we observe these works usually hold two assumptions, which prevent them from being smoothly applied to robot applications: (1) positions of all pedestrians are consistently tracked, and (2) the target agent pays attention to all pedestrians in the scene.", "The first assumption leads to biased interaction modeling with incomplete pedestrian data.", "The second assumption introduces aggregation of redundant surrounding information, and the target agent may be affected by unimportant neighbors or present overly conservative motion.", "Thus, we propose Gumbel Social Transformer, in which an Edge Gumbel Selector samples a sparse interaction graph of partially detected pedestrians at each time step.", "A Node Transformer Encoder and a Masked LSTM encode pedestrian features with sampled sparse graphs to predict trajectories.", "We demonstrate that our model overcomes potential problems caused by the aforementioned assumptions, and our approach outperforms related works in trajectory prediction benchmarks.", "Code is available at \\url{https://github.com/tedhuang96/gst}."]}
{"id": "http://arxiv.org/abs/2107.07699", "title": "A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): from Convolutional Neural Networks to Visual Transformers.", "authors": "Peng Zhao, Chen Li, Md Mamunur Rahaman, Hao Xu, Hechen Yang, Hongzan Sun, Tao Jiang, Marcin Grzegorzek", "abstract": "In recent years, deep learning has made brilliant achievements in Environmental Microorganism (EM) image classification. However, image classification of small EM datasets has still not obtained good research results. Therefore, researchers need to spend a lot of time searching for models with good classification performance and suitable for the current equipment working environment. To provide reliable references for researchers, we conduct a series of comparison experiments on 21 deep learning models. The experiment includes direct classification, imbalanced training, and hyperparameter tuning experiments. During the experiments, we find complementarities among the 21 models, which is the basis for feature fusion related experiments. We also find that the data augmentation method of geometric deformation is difficult to improve the performance of VTs (ViT, DeiT, BotNet and T2T-ViT) series models. In terms of model performance, Xception has the best classification performance, the ViT model consumes the least time for training, and the ShuffleNet-V2 model has the least number of parameters.", "sentences": ["A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): from Convolutional Neural Networks to Visual Transformers.", "In recent years, deep learning has made brilliant achievements in Environmental Microorganism (EM) image classification.", "However, image classification of small EM datasets has still not obtained good research results.", "Therefore, researchers need to spend a lot of time searching for models with good classification performance and suitable for the current equipment working environment.", "To provide reliable references for researchers, we conduct a series of comparison experiments on 21 deep learning models.", "The experiment includes direct classification, imbalanced training, and hyperparameter tuning experiments.", "During the experiments, we find complementarities among the 21 models, which is the basis for feature fusion related experiments.", "We also find that the data augmentation method of geometric deformation is difficult to improve the performance of VTs (ViT, DeiT, BotNet and T2T-ViT) series models.", "In terms of model performance, Xception has the best classification performance, the ViT model consumes the least time for training, and the ShuffleNet-V2 model has the least number of parameters."]}
{"id": "http://arxiv.org/abs/2108.02234", "title": "Multi-Branch with Attention Network for Hand-Based Person Recognition.", "authors": "Nathanael L. Baisa, Bryan Williams, Hossein Rahmani, Plamen Angelov, Sue Black", "abstract": "In this paper, we propose a novel hand-based person recognition method for the purpose of criminal investigations since the hand image is often the only available information in cases of serious crime such as sexual abuse. Our proposed method, Multi-Branch with Attention Network (MBA-Net), incorporates both channel and spatial attention modules in branches in addition to a global (without attention) branch to capture global structural information for discriminative feature learning. The attention modules focus on the relevant features of the hand image while suppressing the irrelevant backgrounds. In order to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling, we integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels. Extensive evaluations on two large multi-ethnic and publicly available hand datasets demonstrate that our proposed method achieves state-of-the-art performance, surpassing the existing hand-based identification methods.", "sentences": ["Multi-Branch with Attention Network for Hand-Based Person Recognition.", "In this paper, we propose a novel hand-based person recognition method for the purpose of criminal investigations since the hand image is often the only available information in cases of serious crime such as sexual abuse.", "Our proposed method, Multi-Branch with Attention Network (MBA-Net), incorporates both channel and spatial attention modules in branches in addition to a global (without attention) branch to capture global structural information for discriminative feature learning.", "The attention modules focus on the relevant features of the hand image while suppressing the irrelevant backgrounds.", "In order to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling, we integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels.", "Extensive evaluations on two large multi-ethnic and publicly available hand datasets demonstrate that our proposed method achieves state-of-the-art performance, surpassing the existing hand-based identification methods."]}
{"id": "http://arxiv.org/abs/2108.10869", "title": "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras.", "authors": "Zachary Teed, Jia Deng", "abstract": "We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM.", "sentences": ["DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras.", "We introduce DROID-SLAM, a new deep learning based SLAM system.", "DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer.", "DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures.", "Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time.", "The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM."]}
{"id": "http://arxiv.org/abs/2109.02618", "title": "Bridging the Gap between Events and Frames through Unsupervised Domain Adaptation.", "authors": "Nico Messikommer, Daniel Gehrig, Mathias Gehrig, Davide Scaramuzza", "abstract": "Reliable perception during fast motion maneuvers or in high dynamic range environments is crucial for robotic systems. Since event cameras are robust to these challenging conditions, they have great potential to increase the reliability of robot vision. However, event-based vision has been held back by the shortage of labeled datasets due to the novelty of event cameras. To overcome this drawback, we propose a task transfer method to train models directly with labeled images and unlabeled event data. Compared to previous approaches, (i) our method transfers from single images to events instead of high frame rate videos, and (ii) does not rely on paired sensor data. To achieve this, we leverage the generative event model to split event features into content and motion features. This split enables efficient matching between latent spaces for events and images, which is crucial for successful task transfer. Thus, our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks. Our task transfer method consistently outperforms methods targeting Unsupervised Domain Adaptation for object detection by 0.26 mAP (increase by 93%) and classification by 2.7% accuracy.", "sentences": ["Bridging the Gap between Events and Frames through Unsupervised Domain Adaptation.", "Reliable perception during fast motion maneuvers or in high dynamic range environments is crucial for robotic systems.", "Since event cameras are robust to these challenging conditions, they have great potential to increase the reliability of robot vision.", "However, event-based vision has been held back by the shortage of labeled datasets due to the novelty of event cameras.", "To overcome this drawback, we propose a task transfer method to train models directly with labeled images and unlabeled event data.", "Compared to previous approaches, (i) our method transfers from single images to events instead of high frame rate videos, and (ii) does not rely on paired sensor data.", "To achieve this, we leverage the generative event model to split event features into content and motion features.", "This split enables efficient matching between latent spaces for events and images, which is crucial for successful task transfer.", "Thus, our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks.", "Our task transfer method consistently outperforms methods targeting Unsupervised Domain Adaptation for object detection by 0.26 mAP (increase by 93%) and classification by 2.7% accuracy."]}
{"id": "http://arxiv.org/abs/2109.05759", "title": "Global-Local Dynamic Feature Alignment Network for Person Re-Identification.", "authors": "Zhangqiang Ming, Yong Yang, Xiaoyong Wei, Jianrong Yan, Xiangkun Wang, Fengjie Wang, Min Zhu", "abstract": "The misalignment of human images caused by bounding box detection errors or partial occlusions is one of the main challenges in person Re-Identification (Re-ID) tasks. Previous local-based methods mainly focus on learning local features in predefined semantic regions of pedestrians. These methods usually use local hard alignment methods or introduce auxiliary information such as key human pose points to match local features, which are often not applicable when large scene differences are encountered. To solve these problems, we propose a simple and efficient Local Sliding Alignment (LSA) strategy to dynamically align the local features of two images by setting a sliding window on the local stripes of the pedestrian. LSA can effectively suppress spatial misalignment and does not need to introduce extra supervision information. Then, we design a Global-Local Dynamic Feature Alignment Network (GLDFA-Net) framework, which contains both global and local branches. We introduce LSA into the local branch of GLDFA-Net to guide the computation of distance metrics, which can further improve the accuracy of the testing phase. Evaluation experiments on several mainstream evaluation datasets including Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 show that our method has competitive accuracy over the several state-of-the-art person Re-ID methods. Specifically, it achieves 86.1% mAP and 94.8% Rank-1 accuracy on Market1501.", "sentences": ["Global-Local Dynamic Feature Alignment Network for Person Re-Identification.", "The misalignment of human images caused by bounding box detection errors or partial occlusions is one of the main challenges in person Re-Identification (Re-ID) tasks.", "Previous local-based methods mainly focus on learning local features in predefined semantic regions of pedestrians.", "These methods usually use local hard alignment methods or introduce auxiliary information such as key human pose points to match local features, which are often not applicable when large scene differences are encountered.", "To solve these problems, we propose a simple and efficient Local Sliding Alignment (LSA) strategy to dynamically align the local features of two images by setting a sliding window on the local stripes of the pedestrian.", "LSA can effectively suppress spatial misalignment and does not need to introduce extra supervision information.", "Then, we design a Global-Local Dynamic Feature Alignment Network (GLDFA-Net) framework, which contains both global and local branches.", "We introduce LSA into the local branch of GLDFA-Net to guide the computation of distance metrics, which can further improve the accuracy of the testing phase.", "Evaluation experiments on several mainstream evaluation datasets including Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 show that our method has competitive accuracy over the several state-of-the-art person Re-ID methods.", "Specifically, it achieves 86.1% mAP and 94.8% Rank-1 accuracy on Market1501."]}
{"id": "http://arxiv.org/abs/2110.09348", "title": "Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "authors": "Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian", "abstract": "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.", "sentences": ["Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations.", "Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image.", "Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution.", "Among these methods, contrastive learning prevents collapse via negative sample pairs.", "It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space.", "Here, we show that dimensional collapse also happens in contrastive learning.", "In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse.", "Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.", "Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet."]}
{"id": "http://arxiv.org/abs/2111.01134", "title": "Comparing Bayesian Models for Organ Contouring in Head and Neck Radiotherapy.", "authors": "Prerak Mody, Nicolas Chaves-de-Plaza, Klaus Hildebrandt, Rene van Egmond, Huib de Ridder, Marius Staring", "abstract": "Deep learning models for organ contouring in radiotherapy are poised for clinical usage, but currently, there exist few tools for automated quality assessment (QA) of the predicted contours. Using Bayesian models and their associated uncertainty, one can potentially automate the process of detecting inaccurate predictions. We investigate two Bayesian models for auto-contouring, DropOut and FlipOut, using a quantitative measure - expected calibration error (ECE) and a qualitative measure - region-based accuracy-vs-uncertainty (R-AvU) graphs. It is well understood that a model should have low ECE to be considered trustworthy. However, in a QA context, a model should also have high uncertainty in inaccurate regions and low uncertainty in accurate regions. Such behaviour could direct visual attention of expert users to potentially inaccurate regions, leading to a speed up in the QA process. Using R-AvU graphs, we qualitatively compare the behaviour of different models in accurate and inaccurate regions. Experiments are conducted on the MICCAI2015 Head and Neck Segmentation Challenge and on the DeepMindTCIA CT dataset using three models: DropOut-DICE, Dropout-CE (Cross Entropy) and FlipOut-CE. Quantitative results show that DropOut-DICE has the highest ECE, while Dropout-CE and FlipOut-CE have the lowest ECE. To better understand the difference between DropOut-CE and FlipOut-CE, we use the R-AvU graph which shows that FlipOut-CE has better uncertainty coverage in inaccurate regions than DropOut-CE. Such a combination of quantitative and qualitative metrics explores a new approach that helps to select which model can be deployed as a QA tool in clinical settings.", "sentences": ["Comparing Bayesian Models for Organ Contouring in Head and Neck Radiotherapy.", "Deep learning models for organ contouring in radiotherapy are poised for clinical usage, but currently, there exist few tools for automated quality assessment (QA) of the predicted contours.", "Using Bayesian models and their associated uncertainty, one can potentially automate the process of detecting inaccurate predictions.", "We investigate two Bayesian models for auto-contouring, DropOut and FlipOut, using a quantitative measure - expected calibration error (ECE) and a qualitative measure - region-based accuracy-vs-uncertainty (R-AvU) graphs.", "It is well understood that a model should have low ECE to be considered trustworthy.", "However, in a QA context, a model should also have high uncertainty in inaccurate regions and low uncertainty in accurate regions.", "Such behaviour could direct visual attention of expert users to potentially inaccurate regions, leading to a speed up in the QA process.", "Using R-AvU graphs, we qualitatively compare the behaviour of different models in accurate and inaccurate regions.", "Experiments are conducted on the MICCAI2015 Head and Neck Segmentation Challenge and on the DeepMindTCIA CT dataset using three models: DropOut-DICE, Dropout-CE (Cross Entropy) and FlipOut-CE.", "Quantitative results show that DropOut-DICE has the highest ECE, while Dropout-CE and FlipOut-CE have the lowest ECE.", "To better understand the difference between DropOut-CE and FlipOut-CE, we use the R-AvU graph which shows that FlipOut-CE has better uncertainty coverage in inaccurate regions than DropOut-CE.", "Such a combination of quantitative and qualitative metrics explores a new approach that helps to select which model can be deployed as a QA tool in clinical settings."]}
{"id": "http://arxiv.org/abs/2111.15208", "title": "HRNET: AI on Edge for mask detection and social distancing.", "authors": "Kinshuk Sengupta, Praveen Ranjan Srivastava", "abstract": "The purpose of the paper is to provide innovative emerging technology framework for community to combat epidemic situations. The paper proposes a unique outbreak response system framework based on artificial intelligence and edge computing for citizen centric services to help track and trace people eluding safety policies like mask detection and social distancing measure in public or workplace setup. The framework further provides implementation guideline in industrial setup as well for governance and contact tracing tasks. The adoption will thus lead in smart city planning and development focusing on citizen health systems contributing to improved quality of life. The conceptual framework presented is validated through quantitative data analysis via secondary data collection from researcher's public websites, GitHub repositories and renowned journals and further benchmarking were conducted for experimental results in Microsoft Azure cloud environment. The study includes selective AI-models for benchmark analysis and were assessed on performance and accuracy in edge computing environment for large scale societal setup. Overall YOLO model Outperforms in object detection task and is faster enough for mask detection and HRNetV2 outperform semantic segmentation problem applied to solve social distancing task in AI-Edge inferencing environmental setup. The paper proposes new Edge-AI algorithm for building technology-oriented solutions for detecting mask in human movement and social distance. The paper enriches the technological advancement in artificial intelligence and edge-computing applied to problems in society and healthcare systems. The framework further equips government agency, system providers to design and constructs technology-oriented models in community setup to Increase the quality of life using emerging technologies into smart urban environments.", "sentences": ["HRNET: AI on Edge for mask detection and social distancing.", "The purpose of the paper is to provide innovative emerging technology framework for community to combat epidemic situations.", "The paper proposes a unique outbreak response system framework based on artificial intelligence and edge computing for citizen centric services to help track and trace people eluding safety policies like mask detection and social distancing measure in public or workplace setup.", "The framework further provides implementation guideline in industrial setup as well for governance and contact tracing tasks.", "The adoption will thus lead in smart city planning and development focusing on citizen health systems contributing to improved quality of life.", "The conceptual framework presented is validated through quantitative data analysis via secondary data collection from researcher's public websites, GitHub repositories and renowned journals and further benchmarking were conducted for experimental results in Microsoft Azure cloud environment.", "The study includes selective AI-models for benchmark analysis and were assessed on performance and accuracy in edge computing environment for large scale societal setup.", "Overall YOLO model Outperforms in object detection task and is faster enough for mask detection and HRNetV2 outperform semantic segmentation problem applied to solve social distancing task in AI-Edge inferencing environmental setup.", "The paper proposes new Edge-AI algorithm for building technology-oriented solutions for detecting mask in human movement and social distance.", "The paper enriches the technological advancement in artificial intelligence and edge-computing applied to problems in society and healthcare systems.", "The framework further equips government agency, system providers to design and constructs technology-oriented models in community setup to Increase the quality of life using emerging technologies into smart urban environments."]}
{"id": "http://arxiv.org/abs/2112.05290", "title": "Image-to-Image Translation-based Data Augmentation for Robust EV Charging Inlet Detection.", "authors": "Yeonjun Bang, Yeejin Lee, Byeongkeun Kang", "abstract": "This work addresses the task of electric vehicle (EV) charging inlet detection for autonomous EV charging robots. Recently, automated EV charging systems have received huge attention to improve users' experience and to efficiently utilize charging infrastructures and parking lots. However, most related works have focused on system design, robot control, planning, and manipulation. Towards robust EV charging inlet detection, we propose a new dataset (EVCI dataset) and a novel data augmentation method that is based on image-to-image translation where typical image-to-image translation methods synthesize a new image in a different domain given an image. To the best of our knowledge, the EVCI dataset is the first EV charging inlet dataset. For the data augmentation method, we focus on being able to control synthesized images' captured environments (e.g., time, lighting) in an intuitive way. To achieve this, we first propose the environment guide vector that humans can intuitively interpret. We then propose a novel image-to-image translation network that translates a given image towards the environment described by the vector. Accordingly, it aims to synthesize a new image that has the same content as the given image while looking like captured in the provided environment by the environment guide vector. Lastly, we train a detection method using the augmented dataset. Through experiments on the EVCI dataset, we demonstrate that the proposed method outperforms the state-of-the-art methods. We also show that the proposed method is able to control synthesized images using an image and environment guide vectors.", "sentences": ["Image-to-Image Translation-based Data Augmentation for Robust EV Charging Inlet Detection.", "This work addresses the task of electric vehicle (EV) charging inlet detection for autonomous EV charging robots.", "Recently, automated EV charging systems have received huge attention to improve users' experience and to efficiently utilize charging infrastructures and parking lots.", "However, most related works have focused on system design, robot control, planning, and manipulation.", "Towards robust EV charging inlet detection, we propose a new dataset (EVCI dataset) and a novel data augmentation method that is based on image-to-image translation where typical image-to-image translation methods synthesize a new image in a different domain given an image.", "To the best of our knowledge, the EVCI dataset is the first EV charging inlet dataset.", "For the data augmentation method, we focus on being able to control synthesized images' captured environments (e.g., time, lighting) in an intuitive way.", "To achieve this, we first propose the environment guide vector that humans can intuitively interpret.", "We then propose a novel image-to-image translation network that translates a given image towards the environment described by the vector.", "Accordingly, it aims to synthesize a new image that has the same content as the given image while looking like captured in the provided environment by the environment guide vector.", "Lastly, we train a detection method using the augmented dataset.", "Through experiments on the EVCI dataset, we demonstrate that the proposed method outperforms the state-of-the-art methods.", "We also show that the proposed method is able to control synthesized images using an image and environment guide vectors."]}
{"id": "http://arxiv.org/abs/2112.06809", "title": "Persistent Object Identification Leveraging Non-Visual Markers.", "authors": "Michael P. J. Camilleri, Li Zhang, Rasneer S. Bains, Andrew Zisserman, Christopher K. I. Williams", "abstract": "Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracking approaches unusable. However, a coarse estimate of each mouse's location is available from a unique RFID implant, so there is the potential to optimally combine information from (weak) tracking with coarse information on identity. To achieve our objective, we make the following key contributions: (a) the formulation of the object identification problem as an assignment problem (solved using Integer Linear Programming), and (b) a novel probabilistic model of the affinity between tracklets and RFID data. The latter is a crucial part of the model, as it provides a principled probabilistic treatment of object detections given coarse localisation. Our approach achieves 77% accuracy on this animal identification problem, and is able to reject spurious detections when the animals are hidden.", "sentences": ["Persistent Object Identification Leveraging Non-Visual Markers.", "Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research.", "This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracking approaches unusable.", "However, a coarse estimate of each mouse's location is available from a unique RFID implant, so there is the potential to optimally combine information from (weak) tracking with coarse information on identity.", "To achieve our objective, we make the following key contributions: (a) the formulation of the object identification problem as an assignment problem (solved using Integer Linear Programming), and (b) a novel probabilistic model of the affinity between tracklets and RFID data.", "The latter is a crucial part of the model, as it provides a principled probabilistic treatment of object detections given coarse localisation.", "Our approach achieves 77% accuracy on this animal identification problem, and is able to reject spurious detections when the animals are hidden."]}
{"id": "http://arxiv.org/abs/2112.11641", "title": "JoJoGAN: One Shot Face Stylization.", "authors": "Min Jin Chong, David Forsyth", "abstract": "A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons). This paper describes a simple procedure -- JoJoGAN -- to learn a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN's style-mixing property to produce a substantial paired dataset from a single example style. The paired dataset is then used to fine-tune a StyleGAN. An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN. JoJoGAN needs just one reference and as little as 30 seconds of training time. JoJoGAN can use extreme style references (say, animal faces) successfully. Furthermore, one can control what aspects of the style are used and how much of the style is applied. Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art.", "sentences": ["JoJoGAN: One Shot Face Stylization.", "A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons).", "This paper describes a simple procedure -- JoJoGAN -- to learn a style mapper from a single example of the style.", "JoJoGAN uses a GAN inversion procedure and StyleGAN's style-mixing property to produce a substantial paired dataset from a single example style.", "The paired dataset is then used to fine-tune a StyleGAN.", "An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN.", "JoJoGAN needs just one reference and as little as 30 seconds of training time.", "JoJoGAN can use extreme style references (say, animal faces) successfully.", "Furthermore, one can control what aspects of the style are used and how much of the style is applied.", "Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art."]}
{"id": "http://arxiv.org/abs/2201.12733", "title": "TPC: Transformation-Specific Smoothing for Point Cloud Models.", "authors": "Wenda Chu, Linyi Li, Bo Li", "abstract": "Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles. However, such models are shown vulnerable against adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions. In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks. We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively. We then specify unique certification protocols for a range of specific semantic transformations and their compositions. Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art. For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\\circ$) from 20.3$\\%$ to 83.8$\\%$.", "sentences": ["TPC: Transformation-Specific Smoothing for Point Cloud Models.", "Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles.", "However, such models are shown vulnerable against adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions.", "In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks.", "We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively.", "We then specify unique certification protocols for a range of specific semantic transformations and their compositions.", "Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art.", "For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\\circ$) from 20.3$\\%$ to 83.8$\\%$."]}
{"id": "http://arxiv.org/abs/2202.00448", "title": "Sim2Real Object-Centric Keypoint Detection and Description.", "authors": "Chengliang Zhong, Chao Yang, Jinshan Qi, Fuchun Sun, Huaping Liu, Xiaodong Mu, Wenbing Huang", "abstract": "Keypoint detection and description play a central role in computer vision. Most existing methods are in the form of scene-level prediction, without returning the object classes of different keypoints. In this paper, we propose the object-centric formulation, which, beyond the conventional setting, requires further identifying which object each interest point belongs to. With such fine-grained information, our framework enables more downstream potentials, such as object-level matching and pose estimation in a clustered environment. To get around the difficulty of label collection in the real world, we develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications. The novelties of our training method are three-fold: (i) we integrate the uncertainty into the learning framework to improve feature description of hard cases, e.g., less-textured or symmetric patches; (ii) we decouple the object descriptor into two output branches -- intra-object salience and inter-object distinctness, resulting in a better pixel-wise description; (iii) we enforce cross-view semantic consistency for enhanced robustness in representation learning. Comprehensive experiments on image matching and 6D pose estimation verify the encouraging generalization ability of our method from simulation to reality. Particularly for 6D pose estimation, our method significantly outperforms typical unsupervised/sim2real methods, achieving a closer gap with the fully supervised counterpart. Additional results and videos can be found at https://zhongcl-thu.github.io/rock/", "sentences": ["Sim2Real Object-Centric Keypoint Detection and Description.", "Keypoint detection and description play a central role in computer vision.", "Most existing methods are in the form of scene-level prediction, without returning the object classes of different keypoints.", "In this paper, we propose the object-centric formulation, which, beyond the conventional setting, requires further identifying which object each interest point belongs to.", "With such fine-grained information, our framework enables more downstream potentials, such as object-level matching and pose estimation in a clustered environment.", "To get around the difficulty of label collection in the real world, we develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.", "The novelties of our training method are three-fold: (i) we integrate the uncertainty into the learning framework to improve feature description of hard cases, e.g., less-textured or symmetric patches; (ii) we decouple the object descriptor into two output branches -- intra-object salience and inter-object distinctness, resulting in a better pixel-wise description; (iii) we enforce cross-view semantic consistency for enhanced robustness in representation learning.", "Comprehensive experiments on image matching and 6D pose estimation verify the encouraging generalization ability of our method from simulation to reality.", "Particularly for 6D pose estimation, our method significantly outperforms typical unsupervised/sim2real methods, achieving a closer gap with the fully supervised counterpart.", "Additional results and videos can be found at https://zhongcl-thu.github.io/rock/"]}
{"id": "http://arxiv.org/abs/2202.00677", "title": "An Embarrassingly Simple Consistency Regularization Method for Semi-Supervised Medical Image Segmentation.", "authors": "Hritam Basak, Rajarshi Bhattacharya, Rukhshanda Hussain, Agniv Chatterjee", "abstract": "The scarcity of pixel-level annotation is a prevalent problem in medical image segmentation tasks. In this paper, we introduce a novel regularization strategy involving interpolation-based mixing for semi-supervised medical image segmentation. The proposed method is a new consistency regularization strategy that encourages segmentation of interpolation of two unlabelled data to be consistent with the interpolation of segmentation maps of those data. This method represents a specific type of data-adaptive regularization paradigm which aids to minimize the overfitting of labelled data under high confidence values. The proposed method is advantageous over adversarial and generative models as it requires no additional computation. Upon evaluation on two publicly available MRI datasets: ACDC and MMWHS, experimental results demonstrate the superiority of the proposed method in comparison to existing semi-supervised models. Code is available at: https://github.com/hritam-98/ICT-MedSeg", "sentences": ["An Embarrassingly Simple Consistency Regularization Method for Semi-Supervised Medical Image Segmentation.", "The scarcity of pixel-level annotation is a prevalent problem in medical image segmentation tasks.", "In this paper, we introduce a novel regularization strategy involving interpolation-based mixing for semi-supervised medical image segmentation.", "The proposed method is a new consistency regularization strategy that encourages segmentation of interpolation of two unlabelled data to be consistent with the interpolation of segmentation maps of those data.", "This method represents a specific type of data-adaptive regularization paradigm which aids to minimize the overfitting of labelled data under high confidence values.", "The proposed method is advantageous over adversarial and generative models as it requires no additional computation.", "Upon evaluation on two publicly available MRI datasets: ACDC and MMWHS, experimental results demonstrate the superiority of the proposed method in comparison to existing semi-supervised models.", "Code is available at: https://github.com/hritam-98/ICT-MedSeg"]}
{"id": "http://arxiv.org/abs/2202.01011", "title": "Auto-Transfer: Learning to Route Transferrable Representations.", "authors": "Keerthiram Murugesan, Vijay Sadashivaiah, Ronny Luss, Karthikeyan Shanmugam, Pin-Yu Chen, Amit Dhurandhar", "abstract": "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications. Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting. We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models. We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet. We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors. We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning.", "sentences": ["Auto-Transfer: Learning to Route Transferrable Representations.", "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications.", "Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting.", "We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models.", "We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet.", "We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors.", "We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning."]}
{"id": "http://arxiv.org/abs/2202.01197", "title": "VOS: Learning What You Don't Know by Virtual Outlier Synthesis.", "authors": "Xuefeng Du, Zhaoning Wang, Mu Cai, Yixuan Li", "abstract": "Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves state-of-the-art performance on both object detection and image classification models, reducing the FPR95 by up to 7.87% compared to the previous best method. Code is available at https://github.com/deeplearning-wisc/vos.", "sentences": ["VOS: Learning What You Don't Know by Virtual Outlier Synthesis.", "Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks.", "One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data.", "Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice.", "In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training.", "Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space.", "Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data.", "VOS achieves state-of-the-art performance on both object detection and image classification models, reducing the FPR95 by up to 7.87% compared to the previous best method.", "Code is available at https://github.com/deeplearning-wisc/vos."]}
{"id": "http://arxiv.org/abs/2202.01116", "title": "Unpaired Image Super-Resolution with Optimal Transport Maps.", "authors": "Milena Gazdieva, Litu Rout, Alexander Korotin, Alexander Filippov, Evgeny Burnaev", "abstract": "Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs) which yield complex training losses with several regularization terms such as content and identity losses. We theoretically investigate the optimization problems which arise in such models and find two surprising observations. First, the learned SR map is always an optimal transport (OT) map. Second, we empirically show that the learned map is biased, i.e., it may not actually transform the distribution of low-resolution images to high-resolution images. Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost. Unlike existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the neccesity to perform complex hyperparameter selection and use additional regularizations. At the same time, it provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset.", "sentences": ["Unpaired Image Super-Resolution with Optimal Transport Maps.", "Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques.", "As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs) which yield complex training losses with several regularization terms such as content and identity losses.", "We theoretically investigate the optimization problems which arise in such models and find two surprising observations.", "First, the learned SR map is always an optimal transport (OT) map.", "Second, we empirically show that the learned map is biased, i.e., it may not actually transform the distribution of low-resolution images to high-resolution images.", "Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost.", "Unlike existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the neccesity to perform complex hyperparameter selection and use additional regularizations.", "At the same time, it provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset."]}
{"id": "http://arxiv.org/abs/2202.01390", "title": "Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "authors": "Joachim Gudmundsson, Martin P. Seybold, John Pfeifer", "abstract": "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning. We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.  Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement. We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification. The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.  We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies. Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches.", "sentences": ["Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning.", "We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.", "Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement.", "We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification.", "The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.", "We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies.", "Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches."]}
{"id": "http://arxiv.org/abs/2202.01721", "title": "Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits.", "authors": "Aaron David Tucker, Thorsten Joachims", "abstract": "Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\\\"ive approaches.", "sentences": ["Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits.", "Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data.", "However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated.", "To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation.", "To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.", "We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\\\"ive approaches."]}
{"id": "http://arxiv.org/abs/2202.01210", "title": "Deep Layer-wise Networks Have Closed-Form Weights.", "authors": "Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy", "abstract": "There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \\textit{one layer at a time} with only a \"single forward pass\" has been proposed as an alternative to bypass BP; we refer to these networks as \"layer-wise\" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\\textit{do they have a closed-form solution?}$ Second, $\\textit{how do we know when to stop adding more layers?}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\\textit{Neural Indicator Kernel}$.", "sentences": ["Deep Layer-wise Networks Have Closed-Form Weights.", "There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP).", "To better mimic the brain, training a network \\textit{one layer at a time} with only a \"single forward pass\" has been proposed as an alternative to bypass BP; we refer to these networks as \"layer-wise\" networks.", "We continue the work on layer-wise networks by answering two outstanding questions.", "First, $\\textit{do they have a closed-form solution?", "}$ Second, $\\textit{how do we know when to stop adding more layers?", "}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\\textit{Neural Indicator Kernel}$."]}
{"id": "http://arxiv.org/abs/2202.01211", "title": "A Flexible Clustering Pipeline for Mining Text Intentions.", "authors": "Xinyu Chen, Ian Beaver", "abstract": "Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support. We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. We describe the pipeline and demonstrate its performance using BERT on three real-world text mining tasks. As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.", "sentences": ["A Flexible Clustering Pipeline for Mining Text Intentions.", "Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support.", "We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts.", "The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection.", "We describe the pipeline and demonstrate its performance using BERT on three real-world text mining tasks.", "As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains."]}
{"id": "http://arxiv.org/abs/2202.01212", "title": "Training Semantic Descriptors for Image-Based Localization.", "authors": "Ibrahim Cinaroglu, Yalin Bastanlar", "abstract": "Vision based solutions for the localization of vehicles have become popular recently. We employ an image retrieval based visual localization approach. The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image. We show that localization can be performed via descriptors solely extracted from semantically segmented images. It is reliable especially when the environment is subjected to severe illumination and seasonal changes. Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods.", "sentences": ["Training Semantic Descriptors for Image-Based Localization.", "Vision based solutions for the localization of vehicles have become popular recently.", "We employ an image retrieval based visual localization approach.", "The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image.", "We show that localization can be performed via descriptors solely extracted from semantically segmented images.", "It is reliable especially when the environment is subjected to severe illumination and seasonal changes.", "Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods."]}
{"id": "http://arxiv.org/abs/2202.01214", "title": "Approximate Bisimulation Relations for Neural Networks and Application to Assured Neural Network Compression.", "authors": "Weiming Xiang, Zhongzhu Shao", "abstract": "In this paper, we propose a concept of approximate bisimulation relation for feedforward neural networks. In the framework of approximate bisimulation relation, a novel neural network merging method is developed to compute the approximate bisimulation error between two neural networks based on reachability analysis of neural networks. The developed method is able to quantitatively measure the distance between the outputs of two neural networks with the same inputs. Then, we apply the approximate bisimulation relation results to perform neural networks model reduction and compute the compression precision, i.e., assured neural networks compression. At last, using the assured neural network compression, we accelerate the verification processes of ACAS Xu neural networks to illustrate the effectiveness and advantages of our proposed approximate bisimulation approach.", "sentences": ["Approximate Bisimulation Relations for Neural Networks and Application to Assured Neural Network Compression.", "In this paper, we propose a concept of approximate bisimulation relation for feedforward neural networks.", "In the framework of approximate bisimulation relation, a novel neural network merging method is developed to compute the approximate bisimulation error between two neural networks based on reachability analysis of neural networks.", "The developed method is able to quantitatively measure the distance between the outputs of two neural networks with the same inputs.", "Then, we apply the approximate bisimulation relation results to perform neural networks model reduction and compute the compression precision, i.e., assured neural networks compression.", "At last, using the assured neural network compression, we accelerate the verification processes of ACAS Xu neural networks to illustrate the effectiveness and advantages of our proposed approximate bisimulation approach."]}
{"id": "http://arxiv.org/abs/2202.01243", "title": "Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference.", "authors": "Jasper Tan, Blake Mason, Hamid Javadi, Richard G. Baraniuk", "abstract": "A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error.", "sentences": ["Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference.", "A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data).", "This has led to an arms race towards increasingly overparameterized models (c.f., deep learning).", "In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model.", "We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters.", "Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior.", "Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error."]}
{"id": "http://arxiv.org/abs/2202.01252", "title": "Speaker Normalization for Self-supervised Speech Emotion Recognition.", "authors": "Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory", "abstract": "Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.", "sentences": ["Speaker Normalization for Self-supervised Speech Emotion Recognition.", "Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases.", "Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics.", "These shortcuts usually harm a model's ability to generalize.", "To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation.", "We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset."]}
{"id": "http://arxiv.org/abs/2202.01258", "title": "Accelerated Quality-Diversity for Robotics through Massive Parallelism.", "authors": "Bryan Lim, Maxime Allard, Luca Grillotti, Antoine Cully", "abstract": "Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.", "sentences": ["Accelerated Quality-Diversity for Robotics through Massive Parallelism.", "Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies.", "However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks.", "Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs.", "With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU.", "In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible.", "We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer.", "More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales.", "The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes.", "These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning."]}
{"id": "http://arxiv.org/abs/2202.01261", "title": "Efficient Memory Partitioning in Software Defined Hardware.", "authors": "Matthew Feldman, Tian Zhao, Kunle Olukotun", "abstract": "As programmers turn to software-defined hardware (SDH) to maintain a high level of productivity while programming hardware to run complex algorithms, heavy-lifting must be done by the compiler to automatically partition on-chip arrays. In this paper, we introduce an automatic memory partitioning system that can quickly compute more efficient partitioning schemes than prior systems. Our system employs a variety of resource-saving optimizations and an ML cost model to select the best partitioning scheme from an array of candidates. We compared our system against various state-of-the-art SDH compilers and FPGAs on a variety of benchmarks and found that our system generates solutions that, on average, consume 40.3% fewer logic resources, 78.3% fewer FFs, 54.9% fewer Block RAMs (BRAMs), and 100% fewer DSPs.", "sentences": ["Efficient Memory Partitioning in Software Defined Hardware.", "As programmers turn to software-defined hardware (SDH) to maintain a high level of productivity while programming hardware to run complex algorithms, heavy-lifting must be done by the compiler to automatically partition on-chip arrays.", "In this paper, we introduce an automatic memory partitioning system that can quickly compute more efficient partitioning schemes than prior systems.", "Our system employs a variety of resource-saving optimizations and an ML cost model to select the best partitioning scheme from an array of candidates.", "We compared our system against various state-of-the-art SDH compilers and FPGAs on a variety of benchmarks and found that our system generates solutions that, on average, consume 40.3% fewer logic resources, 78.3% fewer FFs, 54.9% fewer Block RAMs (BRAMs), and 100% fewer DSPs."]}
{"id": "http://arxiv.org/abs/2202.01263", "title": "NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections.", "authors": "N. Benjamin Erichson, Soon Hoe Lim, Francisco Utrera, Winnie Xu, Ziang Cao, Michael W. Mahoney", "abstract": "For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic. Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts. Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy. This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities. We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix.", "sentences": ["NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections.", "For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic.", "Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts.", "Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy.", "This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities.", "We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P.", "Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix."]}
{"id": "http://arxiv.org/abs/2202.01267", "title": "FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations.", "authors": "Jinhyun So, Kevin Hsieh, Behnaz Arzani, Shadi Noghabi, Salman Avestimehr, Ranveer Chandra", "abstract": "Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation. However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution. To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites. We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness. We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits. Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms.", "sentences": ["FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations.", "Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation.", "However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution.", "To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites.", "We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness.", "We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits.", "Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms."]}
{"id": "http://arxiv.org/abs/2202.01268", "title": "DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization.", "authors": "Alexander Tyurin, Peter Richt\u00e1rik", "abstract": "We develop and analyze DASHA: a new family of methods for nonconvex distributed optimization problems. When the local functions at the nodes have a finite-sum or an expectation form, our new methods, DASHA-PAGE and DASHA-SYNC-MVR, improve the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA by Gorbunov et al. (2020). In particular, to achieve an epsilon-stationary point, and considering the random sparsifier RandK as an example, our methods compute the optimal number of gradients $\\mathcal{O}\\left(\\frac{\\sqrt{m}}{\\varepsilon\\sqrt{n}}\\right)$ and $\\mathcal{O}\\left(\\frac{\\sigma}{\\varepsilon^{3/2}n}\\right)$ in finite-sum and expectation form cases, respectively, while maintaining the SOTA communication complexity $\\mathcal{O}\\left(\\frac{d}{\\varepsilon \\sqrt{n}}\\right)$. Furthermore, unlike MARINA, the new methods DASHA, DASHA-PAGE and DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for federated learning. We extend our results to the case when the functions satisfy the Polyak-Lojasiewicz condition. Finally, our theory is corroborated in practice: we see a significant improvement in experiments with nonconvex classification and training of deep learning models.", "sentences": ["DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization.", "We develop and analyze DASHA: a new family of methods for nonconvex distributed optimization problems.", "When the local functions at the nodes have a finite-sum or an expectation form, our new methods, DASHA-PAGE and DASHA-SYNC-MVR, improve the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA by Gorbunov et al.", "(2020).", "In particular, to achieve an epsilon-stationary point, and considering the random sparsifier RandK as an example, our methods compute the optimal number of gradients $\\mathcal{O}\\left(\\frac{\\sqrt{m}}{\\varepsilon\\sqrt{n}}\\right)$ and $\\mathcal{O}\\left(\\frac{\\sigma}{\\varepsilon^{3/2}n}\\right)$ in finite-sum and expectation form cases, respectively, while maintaining the SOTA communication complexity $\\mathcal{O}\\left(\\frac{d}{\\varepsilon \\sqrt{n}}\\right)$.", "Furthermore, unlike MARINA, the new methods DASHA, DASHA-PAGE and DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for federated learning.", "We extend our results to the case when the functions satisfy the Polyak-Lojasiewicz condition.", "Finally, our theory is corroborated in practice: we see a significant improvement in experiments with nonconvex classification and training of deep learning models."]}
{"id": "http://arxiv.org/abs/2202.01269", "title": "Robust Estimation for Nonparametric Families via Generative Adversarial Networks.", "authors": "Banghua Zhu, Jiantao Jiao, Michael I. Jordan", "abstract": "We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.", "sentences": ["Robust Estimation for Nonparametric Families via Generative Adversarial Networks.", "We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples.", "Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem.", "Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions.", "We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation.", "In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work."]}
{"id": "http://arxiv.org/abs/2202.01273", "title": "Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features.", "authors": "Zhaowei Zhu, Jialu Wang, Yang Liu", "abstract": "The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions. Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations. However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy. We empirically observe the failures of these approaches on a number of commonly used datasets. In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones. We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels. The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features. Code is available at github.com/UCSC-REAL/Est-T-MI.", "sentences": ["Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features.", "The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions.", "Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations.", "However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy.", "We empirically observe the failures of these approaches on a number of commonly used datasets.", "In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features.", "The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones.", "We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels.", "The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features.", "Code is available at github.com/UCSC-REAL/Est-T-MI."]}
{"id": "http://arxiv.org/abs/2202.01275", "title": "Topological Classification in a Wasserstein Distance Based Vector Space.", "authors": "Tananun Songdechakraiwut, Bryan M. Krause, Matthew I. Banks, Kirill V. Nourski, Barry D. Van Veen", "abstract": "Classification of large and dense networks based on topology is very difficult due to the computational challenges of extracting meaningful topological features from real-world networks. In this paper we present a computationally tractable approach to topological classification of networks by using principled theory from persistent homology and optimal transport to define a novel vector representation for topological features. The proposed vector space is based on the Wasserstein distance between persistence barcodes. The 1-skeleton of the network graph is employed to obtain 1-dimensional persistence barcodes that represent connected components and cycles. These barcodes and the corresponding Wasserstein distance can be computed very efficiently. The effectiveness of the proposed vector space is demonstrated using support vector machines to classify simulated networks and measured functional brain networks.", "sentences": ["Topological Classification in a Wasserstein Distance Based Vector Space.", "Classification of large and dense networks based on topology is very difficult due to the computational challenges of extracting meaningful topological features from real-world networks.", "In this paper we present a computationally tractable approach to topological classification of networks by using principled theory from persistent homology and optimal transport to define a novel vector representation for topological features.", "The proposed vector space is based on the Wasserstein distance between persistence barcodes.", "The 1-skeleton of the network graph is employed to obtain 1-dimensional persistence barcodes that represent connected components and cycles.", "These barcodes and the corresponding Wasserstein distance can be computed very efficiently.", "The effectiveness of the proposed vector space is demonstrated using support vector machines to classify simulated networks and measured functional brain networks."]}
{"id": "http://arxiv.org/abs/2202.01277", "title": "Global Optimization Networks.", "authors": "Sen Zhao, Erez Louidor, Olexander Mangylov, Maya Gupta", "abstract": "We consider the problem of estimating a good maximizer of a black-box function given noisy examples. To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\\mathcal{O}(D)$ time. In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models. We also extend to \\emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions. Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems.", "sentences": ["Global Optimization Networks.", "We consider the problem of estimating a good maximizer of a black-box function given noisy examples.", "To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\\mathcal{O}(D)$ time.", "In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models.", "We also extend to \\emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions.", "Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems."]}
{"id": "http://arxiv.org/abs/2202.01279", "title": "PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.", "authors": "Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Xiangru Tang, Mike Tian-Jian Jiang, Alexander M. Rush", "abstract": "PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource. PromptSource is available at https://github.com/bigscience-workshop/promptsource.", "sentences": ["PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts.", "PromptSource is a system for creating, sharing, and using natural language prompts.", "Prompts are functions that map an example from a dataset to a natural language input and target output.", "Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively.", "PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool.", "Over 2,000 prompts for roughly 170 datasets are already available in PromptSource.", "PromptSource is available at https://github.com/bigscience-workshop/promptsource."]}
{"id": "http://arxiv.org/abs/2202.01287", "title": "Fenrir: Physics-Enhanced Regression for Initial Value Problems.", "authors": "Filip Tronarp, Nathanael Bosch, Philipp Hennig", "abstract": "We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem. Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier. The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated. In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration. Experimental results demonstrate that the method is on par or moderately better than competing approaches.", "sentences": ["Fenrir: Physics-Enhanced Regression for Initial Value Problems.", "We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem.", "Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier.", "The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated.", "In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration.", "Experimental results demonstrate that the method is on par or moderately better than competing approaches."]}
{"id": "http://arxiv.org/abs/2202.01288", "title": "Imitation Learning by Estimating Expertise of Demonstrators.", "authors": "Mark Beliaev, Andy Shih, Stefano Ermon, Dorsa Sadigh, Ramtin Pedarsani", "abstract": "Many existing imitation learning datasets are collected from multiple demonstrators, each with different expertise at different parts of the environment. Yet, standard imitation learning algorithms typically treat all demonstrators as homogeneous, regardless of their expertise, absorbing the weaknesses of any suboptimal demonstrators. In this work, we show that unsupervised learning over demonstrator expertise can lead to a consistent boost in the performance of imitation learning algorithms. We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator. Our model learns a single policy that can outperform even the best demonstrator, and can be used to estimate the expertise of any demonstrator at any state. We illustrate our findings on real-robotic continuous control tasks from Robomimic and discrete environments such as MiniGrid and chess, out-performing competing methods in $21$ out of $23$ settings, with an average of $7\\%$ and up to $60\\%$ improvement in terms of the final reward.", "sentences": ["Imitation Learning by Estimating Expertise of Demonstrators.", "Many existing imitation learning datasets are collected from multiple demonstrators, each with different expertise at different parts of the environment.", "Yet, standard imitation learning algorithms typically treat all demonstrators as homogeneous, regardless of their expertise, absorbing the weaknesses of any suboptimal demonstrators.", "In this work, we show that unsupervised learning over demonstrator expertise can lead to a consistent boost in the performance of imitation learning algorithms.", "We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators.", "This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator.", "Our model learns a single policy that can outperform even the best demonstrator, and can be used to estimate the expertise of any demonstrator at any state.", "We illustrate our findings on real-robotic continuous control tasks from Robomimic and discrete environments such as MiniGrid and chess, out-performing competing methods in $21$ out of $23$ settings, with an average of $7\\%$ and up to $60\\%$ improvement in terms of the final reward."]}
{"id": "http://arxiv.org/abs/2202.01290", "title": "Cyclical Pruning for Sparse Neural Networks.", "authors": "Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, Tijmen Blankevoort", "abstract": "Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \\textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones. Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios. Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance.", "sentences": ["Cyclical Pruning for Sparse Neural Networks.", "Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy.", "In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights.", "To enable weight recovery, we propose a simple strategy called \\textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones.", "Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios.", "Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance."]}
{"id": "http://arxiv.org/abs/2202.01292", "title": "Improved Regret for Differentially Private Exploration in Linear MDP.", "authors": "Dung Daniel Ngo, Giuseppe Vietri, Zhiwei Steven Wu", "abstract": "We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records. In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions. Prior work on this problem due to Luyo et al. (2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$. We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\\sqrt{K})$ on the number of episodes. The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected. As a result, our algorithm benefits from low switching cost and only performs $O(\\log(K))$ updates, which greatly reduces the amount of privacy noise. Finally, in the most prevalent privacy regimes where the privacy parameter $\\epsilon$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms.", "sentences": ["Improved Regret for Differentially Private Exploration in Linear MDP.", "We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records.", "In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions.", "Prior work on this problem due to Luyo et al.", "(2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$.", "We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\\sqrt{K})$ on the number of episodes.", "The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected.", "As a result, our algorithm benefits from low switching cost and only performs $O(\\log(K))$ updates, which greatly reduces the amount of privacy noise.", "Finally, in the most prevalent privacy regimes where the privacy parameter $\\epsilon$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms."]}
{"id": "http://arxiv.org/abs/2202.01300", "title": "Causal Inference Through the Structural Causal Marginal Problem.", "authors": "Luigi Gresele, Julius von K\u00fcgelgen, Jonas M. K\u00fcbler, Elke Kirschbaum, Bernhard Sch\u00f6lkopf, Dominik Janzing", "abstract": "We introduce an approach to counterfactual inference based on merging information from multiple datasets. We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones. We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs. Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data.", "sentences": ["Causal Inference Through the Structural Causal Marginal Problem.", "We introduce an approach to counterfactual inference based on merging information from multiple datasets.", "We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones.", "We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs.", "Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data."]}
{"id": "http://arxiv.org/abs/2202.01302", "title": "A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "authors": "Fatima Zahrah, Jason R. C. Nurse, Michael Goldsmith", "abstract": "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours. While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks. In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections. Our findings show how content and posting activity can differ depending on the platform being used. Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes. We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem.", "sentences": ["A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election.", "The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours.", "While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks.", "In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections.", "Our findings show how content and posting activity can differ depending on the platform being used.", "Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes.", "We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem."]}
{"id": "http://arxiv.org/abs/2202.01306", "title": "Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers.", "authors": "Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung Kim", "abstract": "Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments. Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.", "sentences": ["Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers.", "Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models.", "One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size.", "The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes.", "Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead.", "In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments.", "Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory."]}
{"id": "http://arxiv.org/abs/2202.01308", "title": "Impact Analysis of Harassment Against Women Using Association Rule Mining Approaches: Bangladesh Prospective.", "authors": "Bahar Uddin Mahmud, Afsana Sharmin", "abstract": "In recent years, it has been noticed that women are making progress in every sector of society. Their involvement in every field, such as education, job market, social work, etc., is increasing at a remarkable rate. For the last several years, the government has been trying its level best for the advancement of women in every sector by doing several research work and activities and funding several organizations to motivate women. Although women's involvement in several fields is increasing, the big concern is they are facing several barriers in their advancement, and it is not surprising that sexual harassment is one of them. In Bangladesh, harassment against women, especially students, is a common phenomenon, and it is increasing. In this paper, a survey-based and Apriori algorithm are used to analyze the several impacts of harassment among several age groups. Also, several factors such as frequent impacts of harassment, most vulnerable groups, women mostly facing harassment, the alleged person behind harassment, etc., are analyzed through association rule mining of Apriori algorithm and F.P. Growth algorithm. And then, a comparison of performance between both algorithms has been shown briefly. For this analysis, data have been carefully collected from all ages.", "sentences": ["Impact Analysis of Harassment Against Women Using Association Rule Mining Approaches: Bangladesh Prospective.", "In recent years, it has been noticed that women are making progress in every sector of society.", "Their involvement in every field, such as education, job market, social work, etc., is increasing at a remarkable rate.", "For the last several years, the government has been trying its level best for the advancement of women in every sector by doing several research work and activities and funding several organizations to motivate women.", "Although women's involvement in several fields is increasing, the big concern is they are facing several barriers in their advancement, and it is not surprising that sexual harassment is one of them.", "In Bangladesh, harassment against women, especially students, is a common phenomenon, and it is increasing.", "In this paper, a survey-based and Apriori algorithm are used to analyze the several impacts of harassment among several age groups.", "Also, several factors such as frequent impacts of harassment, most vulnerable groups, women mostly facing harassment, the alleged person behind harassment, etc., are analyzed through association rule mining of Apriori algorithm and F.P.", "Growth algorithm.", "And then, a comparison of performance between both algorithms has been shown briefly.", "For this analysis, data have been carefully collected from all ages."]}
{"id": "http://arxiv.org/abs/2202.01309", "title": "Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm.", "authors": "Hanieh Shabanian, Madhusudhanan Balasubramanian", "abstract": "A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene. These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene. Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views. We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates. The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene. The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures. When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check. The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps.", "sentences": ["Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm.", "A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene.", "These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene.", "Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views.", "We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates.", "The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene.", "The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures.", "When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check.", "The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps."]}
{"id": "http://arxiv.org/abs/2202.01312", "title": "Causal Imitation Learning under Temporally Correlated Noise.", "authors": "Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, Zhiwei Steven Wu", "abstract": "We develop algorithms for imitation learning from policy data that was corrupted by temporally correlated noise in expert actions. When noise affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the instrumental variable regression (IVR) technique of econometrics, enabling us to recover the underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator, and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We find both of our algorithms compare favorably to behavioral cloning on simulated control tasks.", "sentences": ["Causal Imitation Learning under Temporally Correlated Noise.", "We develop algorithms for imitation learning from policy data that was corrupted by temporally correlated noise in expert actions.", "When noise affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance.", "To break up these spurious correlations, we apply modern variants of the instrumental variable regression (IVR) technique of econometrics, enabling us to recover the underlying policy without requiring access to an interactive expert.", "In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator, and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline.", "We find both of our algorithms compare favorably to behavioral cloning on simulated control tasks."]}
{"id": "http://arxiv.org/abs/2202.01314", "title": "Gradient estimators for normalising flows.", "authors": "Piotr Bialas, Piotr Korcyl, Tomasz Stebel", "abstract": "Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction. In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution. As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields. In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions. We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results.", "sentences": ["Gradient estimators for normalising flows.", "Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction.", "In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution.", "As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields.", "In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions.", "We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results."]}
{"id": "http://arxiv.org/abs/2202.01315", "title": "Approximating Full Conformal Prediction at Scale via Influence Functions.", "authors": "Javier Abad, Umang Bhatt, Adrian Weller, Giovanni Cherubin", "abstract": "Conformal prediction (CP) is a wrapper around traditional machine learning models, giving coverage guarantees under the sole assumption of exchangeability; in classification problems, for a chosen significance level $\\varepsilon$, CP guarantees that the number of errors is at most $\\varepsilon$, irrespective of whether the underlying model is misspecified. However, the prohibitive computational costs of full CP led researchers to design scalable alternatives, which alas do not attain the same guarantees or statistical power of full CP. In this paper, we use influence functions to efficiently approximate full CP. We prove that our method is a consistent approximation of full CP, and empirically show that the approximation error becomes smaller as the training set increases; e.g., for $10^{3}$ training points the two methods output p-values that are $<10^{-3}$ apart: a negligible error for any practical application. Our methods enable scaling full CP to large real-world datasets. We compare our full CP approximation ACP to mainstream CP alternatives, and observe that our method is computationally competitive whilst enjoying the statistical predictive power of full CP.", "sentences": ["Approximating Full Conformal Prediction at Scale via Influence Functions.", "Conformal prediction (CP) is a wrapper around traditional machine learning models, giving coverage guarantees under the sole assumption of exchangeability; in classification problems, for a chosen significance level $\\varepsilon$, CP guarantees that the number of errors is at most $\\varepsilon$, irrespective of whether the underlying model is misspecified.", "However, the prohibitive computational costs of full CP led researchers to design scalable alternatives, which alas do not attain the same guarantees or statistical power of full CP.", "In this paper, we use influence functions to efficiently approximate full CP.", "We prove that our method is a consistent approximation of full CP, and empirically show that the approximation error becomes smaller as the training set increases; e.g., for $10^{3}$ training points the two methods output p-values that are $<10^{-3}$ apart: a negligible error for any practical application.", "Our methods enable scaling full CP to large real-world datasets.", "We compare our full CP approximation ACP to mainstream CP alternatives, and observe that our method is computationally competitive whilst enjoying the statistical predictive power of full CP."]}
{"id": "http://arxiv.org/abs/2202.01319", "title": "Deep Learning for Epidemiologists: An Introduction to Neural Networks.", "authors": "Stylianos Serghiou, Kathryn Rough", "abstract": "Deep learning methods are increasingly being applied to problems in medicine and healthcare. However, few epidemiologists have received formal training in these methods. To bridge this gap, this article introduces to the fundamentals of deep learning from an epidemiological perspective. Specifically, this article reviews core concepts in machine learning (overfitting, regularization, hyperparameters), explains several fundamental deep learning architectures (convolutional neural networks, recurrent neural networks), and summarizes training, evaluation, and deployment of models. We aim to enable the reader to engage with and critically evaluate medical applications of deep learning, facilitating a dialogue between computer scientists and epidemiologists that will improve the safety and efficacy of applications of this technology.", "sentences": ["Deep Learning for Epidemiologists: An Introduction to Neural Networks.", "Deep learning methods are increasingly being applied to problems in medicine and healthcare.", "However, few epidemiologists have received formal training in these methods.", "To bridge this gap, this article introduces to the fundamentals of deep learning from an epidemiological perspective.", "Specifically, this article reviews core concepts in machine learning (overfitting, regularization, hyperparameters), explains several fundamental deep learning architectures (convolutional neural networks, recurrent neural networks), and summarizes training, evaluation, and deployment of models.", "We aim to enable the reader to engage with and critically evaluate medical applications of deep learning, facilitating a dialogue between computer scientists and epidemiologists that will improve the safety and efficacy of applications of this technology."]}
{"id": "http://arxiv.org/abs/2202.01327", "title": "Adaptive Sampling Strategies to Construct Equitable Training Datasets.", "authors": "William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda Bogen, Stevie Bergman, Sharad Goel", "abstract": "In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.", "sentences": ["Adaptive Sampling Strategies to Construct Equitable Training Datasets.", "In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups.", "One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.", "It is often unclear, however, how to operationalize representativeness in specific applications.", "Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.", "We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups.", "We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample.", "This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task.", "When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates.", "To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection.", "We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models."]}
{"id": "http://arxiv.org/abs/2202.01331", "title": "Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions.", "authors": "Aaron Mishkin, Arda Sahiner, Mert Pilanci", "abstract": "We develop fast algorithms and robust software for convex optimization of two-layer neural networks with ReLU activation functions. Our work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-$\\ell_1$-regularized data-local models, where locality is enforced by polyhedral cone constraints. In the special case of zero-regularization, we show that this problem is exactly equivalent to unconstrained optimization of a convex \"gated ReLU\" network. For problems with non-zero regularization, we show that convex gated ReLU models obtain data-dependent approximation bounds for the ReLU training problem. To optimize the convex reformulations, we develop an accelerated proximal gradient method and a practical augmented Lagrangian solver. We show that these approaches are faster than standard training heuristics for the non-convex problem, such as SGD, and outperform commercial interior-point solvers. Experimentally, we verify our theoretical results, explore the group-$\\ell_1$ regularization path, and scale convex optimization for neural networks to image classification on MNIST and CIFAR-10.", "sentences": ["Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions.", "We develop fast algorithms and robust software for convex optimization of two-layer neural networks with ReLU activation functions.", "Our work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-$\\ell_1$-regularized data-local models, where locality is enforced by polyhedral cone constraints.", "In the special case of zero-regularization, we show that this problem is exactly equivalent to unconstrained optimization of a convex \"gated ReLU\" network.", "For problems with non-zero regularization, we show that convex gated ReLU models obtain data-dependent approximation bounds for the ReLU training problem.", "To optimize the convex reformulations, we develop an accelerated proximal gradient method and a practical augmented Lagrangian solver.", "We show that these approaches are faster than standard training heuristics for the non-convex problem, such as SGD, and outperform commercial interior-point solvers.", "Experimentally, we verify our theoretical results, explore the group-$\\ell_1$ regularization path, and scale convex optimization for neural networks to image classification on MNIST and CIFAR-10."]}
{"id": "http://arxiv.org/abs/2202.01332", "title": "Training a Bidirectional GAN-based One-Class Classifier for Network Intrusion Detection.", "authors": "Wen Xu, Julian Jang-Jaccard, Tong Liu, Fariza Sabrina", "abstract": "The network intrusion detection task is challenging because of the imbalanced and unlabeled nature of the dataset it operates on. Existing generative adversarial networks (GANs), are primarily used for creating synthetic samples from reals. They also have been proved successful in anomaly detection tasks. In our proposed method, we construct the trained encoder-discriminator as a one-class classifier based on Bidirectional GAN (Bi-GAN) for detecting anomalous traffic from normal traffic other than calculating expensive and complex anomaly scores or thresholds. Our experimental result illustrates that our proposed method is highly effective to be used in network intrusion detection tasks and outperforms other similar generative methods on the NSL-KDD dataset.", "sentences": ["Training a Bidirectional GAN-based One-Class Classifier for Network Intrusion Detection.", "The network intrusion detection task is challenging because of the imbalanced and unlabeled nature of the dataset it operates on.", "Existing generative adversarial networks (GANs), are primarily used for creating synthetic samples from reals.", "They also have been proved successful in anomaly detection tasks.", "In our proposed method, we construct the trained encoder-discriminator as a one-class classifier based on Bidirectional GAN (Bi-GAN) for detecting anomalous traffic from normal traffic other than calculating expensive and complex anomaly scores or thresholds.", "Our experimental result illustrates that our proposed method is highly effective to be used in network intrusion detection tasks and outperforms other similar generative methods on the NSL-KDD dataset."]}
{"id": "http://arxiv.org/abs/2202.01334", "title": "Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization.", "authors": "Dianbo Liu, Alex Lamb, Xu Ji, Pascal Notsawo, Mike Mozer, Yoshua Bengio, Kenji Kawaguchi", "abstract": "Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.", "sentences": ["Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization.", "Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit.", "It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness.", "The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters.", "In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness.", "We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks."]}
{"id": "http://arxiv.org/abs/2202.01336", "title": "Can Transformers be Strong Treatment Effect Estimators?.", "authors": "Yi-Fan Zhang, Hanlin Zhang, Zachary C. Lipton, Li Erran Li, Eric P. Xing", "abstract": "In this paper, we develop a general framework for based on Transformer architectures to address a variety of challenging treatment effect estimation (TEE) problems. Our methods are applicable both when covariates are tabular and when they consist of sequences (e.g., in text), and can handle discrete, continuous, structured, or dosage-associated treatments. While Transformers have already emerged as dominant methods for diverse domains, including natural language and computer vision, our experiments with Transformers as Treatment Effect Estimators (TransTEE) demonstrate that these inductive biases are also effective on the sorts of estimation problems and datasets that arise in research aimed at estimating causal effects. Moreover, we propose a propensity score network that is trained with TransTEE in an adversarial manner to promote independence between covariates and treatments to further address selection bias. Through extensive experiments, we show that TransTEE significantly outperforms competitive baselines with greater parameter efficiency over a wide range of benchmarks and settings.", "sentences": ["Can Transformers be Strong Treatment Effect Estimators?.", "In this paper, we develop a general framework for based on Transformer architectures to address a variety of challenging treatment effect estimation (TEE) problems.", "Our methods are applicable both when covariates are tabular and when they consist of sequences (e.g., in text), and can handle discrete, continuous, structured, or dosage-associated treatments.", "While Transformers have already emerged as dominant methods for diverse domains, including natural language and computer vision, our experiments with Transformers as Treatment Effect Estimators (TransTEE) demonstrate that these inductive biases are also effective on the sorts of estimation problems and datasets that arise in research aimed at estimating causal effects.", "Moreover, we propose a propensity score network that is trained with TransTEE in an adversarial manner to promote independence between covariates and treatments to further address selection bias.", "Through extensive experiments, we show that TransTEE significantly outperforms competitive baselines with greater parameter efficiency over a wide range of benchmarks and settings."]}
{"id": "http://arxiv.org/abs/2202.01337", "title": "Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls.", "authors": "Farhad Maleki, Katie Ovens, Rajiv Gupta, Caroline Reinhold, Alan Spatz, Reza Forghani", "abstract": "Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice. We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models. We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls. We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score. Our results showed that violation of the independence assumption could substantially affect model generalizability. More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance. We also observed that inappropriate performance indicators could lead to erroneous conclusions. Also, batch effect could lead to developing models that lack generalizability. The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance. These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations. Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models.", "sentences": ["Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls.", "Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice.", "We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models.", "We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls.", "We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score.", "Our results showed that violation of the independence assumption could substantially affect model generalizability.", "More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance.", "We also observed that inappropriate performance indicators could lead to erroneous conclusions.", "Also, batch effect could lead to developing models that lack generalizability.", "The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance.", "These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations.", "Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models."]}
{"id": "http://arxiv.org/abs/2202.01338", "title": "Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "authors": "Jannis Born, Matteo Manica", "abstract": "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens. This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.  Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss. Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark. In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation. This finds application particularly in property-driven, local exploration of the chemical or protein space.", "sentences": ["Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens.", "We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem.", "The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens.", "This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location.", "We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss.", "Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss.", "Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark.", "In sum, the Regression Transformer opens the door for \"swiss army knife\" models that excel at both regression and conditional generation.", "This finds application particularly in property-driven, local exploration of the chemical or protein space."]}
{"id": "http://arxiv.org/abs/2202.01339", "title": "Understanding Cross-Domain Few-Shot Learning: An Experimental Study.", "authors": "Jaehoon Oh, Sungnyun Kim, Namgyu Ho, Jin-Hwa Kim, Hwanjun Song, Se-Young Yun", "abstract": "Cross-domain few-shot learning has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios. To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage. This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain. In this paper, we empirically investigate scenarios under which it is advantageous to use each pre-training scheme, based on domain similarity and few-shot difficulty: performance gain of self-supervised pre-training over supervised pre-training increases when domain similarity is smaller or few-shot difficulty is lower. We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance. In this light, we present seven findings for CD-FSL which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty. Our code is available at https://anonymous.4open.science/r/understandingCDFSL.", "sentences": ["Understanding Cross-Domain Few-Shot Learning: An Experimental Study.", "Cross-domain few-shot learning has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios.", "To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage.", "This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain.", "In this paper, we empirically investigate scenarios under which it is advantageous to use each pre-training scheme, based on domain similarity and few-shot difficulty: performance gain of self-supervised pre-training over supervised pre-training increases when domain similarity is smaller or few-shot difficulty is lower.", "We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance.", "In this light, we present seven findings for CD-FSL which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty.", "Our code is available at https://anonymous.4open.science/r/understandingCDFSL."]}
{"id": "http://arxiv.org/abs/2202.01340", "title": "An Artificial Intelligence Dataset for Solar Energy Locations in India.", "authors": "Anthony Ortiz, Dhaval Negandhi, Sagar R Mysorekar, Joseph Kiesecker, Shivaprakash K Nagaraju, Caleb Robinson, Priyal Bhatia, Aditi Khurana, Jane Wang, Felipe Oviedo, Juan Lavista Ferres", "abstract": "Rapid development of renewable energy sources, particularly solar photovoltaics, is critical to mitigate climate change. As a result, India has set ambitious goals to install 300 gigawatts of solar energy capacity by 2030. Given the large footprint projected to meet these renewable energy targets the potential for land use conflicts over environmental and social values is high. To expedite development of solar energy, land use planners will need access to up-to-date and accurate geo-spatial information of PV infrastructure. The majority of recent studies use either predictions of resource suitability or databases that are either developed thru crowdsourcing that often have significant sampling biases or have time lags between when projects are permitted and when location data becomes available. Here, we address this shortcoming by developing a spatially explicit machine learning model to map utility-scale solar projects across India. Using these outputs, we provide a cumulative measure of the solar footprint across India and quantified the degree of land modification associated with land cover types that may cause conflicts. Our analysis indicates that over 74\\% of solar development In India was built on landcover types that have natural ecosystem preservation, and agricultural values. Thus, with a mean accuracy of 92\\% this method permits the identification of the factors driving land suitability for solar projects and will be of widespread interest for studies seeking to assess trade-offs associated with the global decarbonization of green-energy systems. In the same way, our model increases the feasibility of remote sensing and long-term monitoring of renewable energy deployment targets.", "sentences": ["An Artificial Intelligence Dataset for Solar Energy Locations in India.", "Rapid development of renewable energy sources, particularly solar photovoltaics, is critical to mitigate climate change.", "As a result, India has set ambitious goals to install 300 gigawatts of solar energy capacity by 2030.", "Given the large footprint projected to meet these renewable energy targets the potential for land use conflicts over environmental and social values is high.", "To expedite development of solar energy, land use planners will need access to up-to-date and accurate geo-spatial information of PV infrastructure.", "The majority of recent studies use either predictions of resource suitability or databases that are either developed thru crowdsourcing that often have significant sampling biases or have time lags between when projects are permitted and when location data becomes available.", "Here, we address this shortcoming by developing a spatially explicit machine learning model to map utility-scale solar projects across India.", "Using these outputs, we provide a cumulative measure of the solar footprint across India and quantified the degree of land modification associated with land cover types that may cause conflicts.", "Our analysis indicates that over 74\\% of solar development In India was built on landcover types that have natural ecosystem preservation, and agricultural values.", "Thus, with a mean accuracy of 92\\% this method permits the identification of the factors driving land suitability for solar projects and will be of widespread interest for studies seeking to assess trade-offs associated with the global decarbonization of green-energy systems.", "In the same way, our model increases the feasibility of remote sensing and long-term monitoring of renewable energy deployment targets."]}
{"id": "http://arxiv.org/abs/2202.01341", "title": "Robust Binary Models by Pruning Randomly-initialized Networks.", "authors": "Chen Liu, Ziqi Zhao, Sabine S\u00fcsstrunk, Mathieu Salzmann", "abstract": "We propose ways to obtain robust models against adversarial attacks from randomly-initialized binary networks. Unlike adversarial training, which learns the model parameters, we in contrast learn the structure of the robust model by pruning a randomly-initialized binary network. Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks. Compared to the results obtained in a non-adversarial setting, we in addition improve the performance and compression of the model by 1) using an adaptive pruning strategy for different layers, and 2) using a different initialization scheme such that all model parameters are initialized either to +1 or -1. Our extensive experiments demonstrate that our approach performs not only better than the state-of-the art for robust binary networks; it also achieves comparable or even better performance than full-precision network training methods.", "sentences": ["Robust Binary Models by Pruning Randomly-initialized Networks.", "We propose ways to obtain robust models against adversarial attacks from randomly-initialized binary networks.", "Unlike adversarial training, which learns the model parameters, we in contrast learn the structure of the robust model by pruning a randomly-initialized binary network.", "Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks.", "Compared to the results obtained in a non-adversarial setting, we in addition improve the performance and compression of the model by 1) using an adaptive pruning strategy for different layers, and 2) using a different initialization scheme such that all model parameters are initialized either to +1 or -1.", "Our extensive experiments demonstrate that our approach performs not only better than the state-of-the art for robust binary networks; it also achieves comparable or even better performance than full-precision network training methods."]}
{"id": "http://arxiv.org/abs/2202.01344", "title": "Formal Mathematics Statement Curriculum Learning.", "authors": "Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever", "abstract": "We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.", "sentences": ["Formal Mathematics Statement Curriculum Learning.", "We explore the use of expert iteration in the context of language modeling applied to formal mathematics.", "We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only.", "We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.", "Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads."]}
{"id": "http://arxiv.org/abs/2202.01361", "title": "Generative Flow Networks for Discrete Probabilistic Modeling.", "authors": "Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, Yoshua Bengio", "abstract": "We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks.", "sentences": ["Generative Flow Networks for Discrete Probabilistic Modeling.", "We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data.", "Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet.", "We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes.", "We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet.", "We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks."]}
{"id": "http://arxiv.org/abs/2202.01374", "title": "mSLAM: Massively multilingual joint pre-training for speech and text.", "authors": "Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau", "abstract": "We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.", "sentences": ["mSLAM: Massively multilingual joint pre-training for speech and text.", "We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.", "mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space.", "We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training.", "Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations.", "mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process.", "Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research."]}
{"id": "http://arxiv.org/abs/2202.01375", "title": "Resource Management and Security Scheme of ICPSs and IoT Based on VNE Algorithm.", "authors": "Peiying Zhang, Chao Wang, Chunxiao Jiang, Neeraj Kumar, Qinghua Lu", "abstract": "The development of Intelligent Cyber-Physical Systems (ICPSs) in virtual network environment is facing severe challenges. On the one hand, the Internet of things (IoT) based on ICPSs construction needs a large amount of reasonable network resources support. On the other hand, ICPSs are facing severe network security problems. The integration of ICPSs and network virtualization (NV) can provide more efficient network resource support and security guarantees for IoT users. Based on the above two problems faced by ICPSs, we propose a virtual network embedded (VNE) algorithm with computing, storage resources and security constraints to ensure the rationality and security of resource allocation in ICPSs. In particular, we use reinforcement learning (RL) method as a means to improve algorithm performance. We extract the important attribute characteristics of underlying network as the training environment of RL agent. Agent can derive the optimal node embedding strategy through training, so as to meet the requirements of ICPSs for resource management and security. The embedding of virtual links is based on the breadth first search (BFS) strategy. Therefore, this is a comprehensive two-stage RL-VNE algorithm considering the constraints of computing, storage and security three-dimensional resources. Finally, we design a large number of simulation experiments from the perspective of typical indicators of VNE algorithms. The experimental results effectively illustrate the effectiveness of the algorithm in the application of ICPSs.", "sentences": ["Resource Management and Security Scheme of ICPSs and IoT Based on VNE Algorithm.", "The development of Intelligent Cyber-Physical Systems (ICPSs) in virtual network environment is facing severe challenges.", "On the one hand, the Internet of things (IoT) based on ICPSs construction needs a large amount of reasonable network resources support.", "On the other hand, ICPSs are facing severe network security problems.", "The integration of ICPSs and network virtualization (NV) can provide more efficient network resource support and security guarantees for IoT users.", "Based on the above two problems faced by ICPSs, we propose a virtual network embedded (VNE) algorithm with computing, storage resources and security constraints to ensure the rationality and security of resource allocation in ICPSs.", "In particular, we use reinforcement learning (RL) method as a means to improve algorithm performance.", "We extract the important attribute characteristics of underlying network as the training environment of RL agent.", "Agent can derive the optimal node embedding strategy through training, so as to meet the requirements of ICPSs for resource management and security.", "The embedding of virtual links is based on the breadth first search (BFS) strategy.", "Therefore, this is a comprehensive two-stage RL-VNE algorithm considering the constraints of computing, storage and security three-dimensional resources.", "Finally, we design a large number of simulation experiments from the perspective of typical indicators of VNE algorithms.", "The experimental results effectively illustrate the effectiveness of the algorithm in the application of ICPSs."]}
{"id": "http://arxiv.org/abs/2202.01380", "title": "Learning Mechanically Driven Emergent Behavior with Message Passing Neural Networks.", "authors": "Peerasait Prachaseree, Emma Lejeune", "abstract": "From designing architected materials to connecting mechanical behavior across scales, computational modeling is a critical tool in solid mechanics. Recently, there has been a growing interest in using machine learning to reduce the computational cost of physics-based simulations. Notably, while machine learning approaches that rely on Graph Neural Networks (GNNs) have shown success in learning mechanics, the performance of GNNs has yet to be investigated on a myriad of solid mechanics problems. In this work, we examine the ability of GNNs to predict a fundamental aspect of mechanically driven emergent behavior: the connection between a column's geometric structure and the direction that it buckles. To accomplish this, we introduce the Asymmetric Buckling Columns (ABC) dataset, a dataset comprised of three sub-datasets of asymmetric and heterogeneous column geometries where the goal is to classify the direction of symmetry breaking (left or right) under compression after the onset of instability. Because of complex local geometry, the \"image-like\" data representations required for implementing standard convolutional neural network based metamodels are not ideal, thus motivating the use of GNNs. In addition to investigating GNN model architecture, we study the effect of different input data representation approaches, data augmentation, and combining multiple models as an ensemble. While we were able to obtain good results, we also showed that predicting solid mechanics based emergent behavior is non-trivial. Because both our model implementation and dataset are distributed under open-source licenses, we hope that future researchers can build on our work to create enhanced mechanics-specific machine learning pipelines for capturing the behavior of complex geometric structures.", "sentences": ["Learning Mechanically Driven Emergent Behavior with Message Passing Neural Networks.", "From designing architected materials to connecting mechanical behavior across scales, computational modeling is a critical tool in solid mechanics.", "Recently, there has been a growing interest in using machine learning to reduce the computational cost of physics-based simulations.", "Notably, while machine learning approaches that rely on Graph Neural Networks (GNNs) have shown success in learning mechanics, the performance of GNNs has yet to be investigated on a myriad of solid mechanics problems.", "In this work, we examine the ability of GNNs to predict a fundamental aspect of mechanically driven emergent behavior: the connection between a column's geometric structure and the direction that it buckles.", "To accomplish this, we introduce the Asymmetric Buckling Columns (ABC) dataset, a dataset comprised of three sub-datasets of asymmetric and heterogeneous column geometries where the goal is to classify the direction of symmetry breaking (left or right) under compression after the onset of instability.", "Because of complex local geometry, the \"image-like\" data representations required for implementing standard convolutional neural network based metamodels are not ideal, thus motivating the use of GNNs.", "In addition to investigating GNN model architecture, we study the effect of different input data representation approaches, data augmentation, and combining multiple models as an ensemble.", "While we were able to obtain good results, we also showed that predicting solid mechanics based emergent behavior is non-trivial.", "Because both our model implementation and dataset are distributed under open-source licenses, we hope that future researchers can build on our work to create enhanced mechanics-specific machine learning pipelines for capturing the behavior of complex geometric structures."]}
{"id": "http://arxiv.org/abs/2202.01381", "title": "ETSformer: Exponential Smoothing Transformers for Time-series Forecasting.", "authors": "Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, Steven Hoi", "abstract": "Transformers have been actively studied for time-series forecasting in recent years. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they generally lack of decomposition capability and interpretability, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSFormer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing in improving Transformers for time-series forecasting. In particular, inspired by the classical exponential smoothing methods in time-series forecasting, we propose the novel exponential smoothing attention (ESA) and frequency attention (FA) to replace the self-attention mechanism in vanilla Transformers, thus improving both accuracy and efficiency. Based on these, we redesign the Transformer architecture with modular decomposition blocks such that it can learn to decompose the time-series data into interpretable time-series components such as level, growth and seasonality. Extensive experiments on various time-series benchmarks validate the efficacy and advantages of the proposed method. The code and models of our implementations will be released.", "sentences": ["ETSformer: Exponential Smoothing Transformers for Time-series Forecasting.", "Transformers have been actively studied for time-series forecasting in recent years.", "While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they generally lack of decomposition capability and interpretability, and are neither effective nor efficient for long-term forecasting.", "In this paper, we propose ETSFormer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing in improving Transformers for time-series forecasting.", "In particular, inspired by the classical exponential smoothing methods in time-series forecasting, we propose the novel exponential smoothing attention (ESA) and frequency attention (FA) to replace the self-attention mechanism in vanilla Transformers, thus improving both accuracy and efficiency.", "Based on these, we redesign the Transformer architecture with modular decomposition blocks such that it can learn to decompose the time-series data into interpretable time-series components such as level, growth and seasonality.", "Extensive experiments on various time-series benchmarks validate the efficacy and advantages of the proposed method.", "The code and models of our implementations will be released."]}
{"id": "http://arxiv.org/abs/2202.01383", "title": "Machine Learning Solar Wind Driving Magnetospheric Convection in Tail Lobes.", "authors": "Xin Cao, Jasper S. Halekas, Stein Haaland, Suranga Ruhunusiri, Karl-Heinz Glassmeier", "abstract": "To quantitatively study the driving mechanisms of magnetospheric convection in the magnetotail lobes on a global scale, we utilize data from the ARTEMIS spacecraft in the deep tail and the Cluster spacecraft in the near tail. Previous work demonstrated that, in the lobes near the Moon, we can estimate the convection by utilizing ARTEMIS measurements of lunar ions velocity. In this paper, we analyze these datasets with machine learning models to determine what upstream factors drive the lobe convection in different magnetotail regions and thereby understand the mechanisms that control the dynamics of the tail lobes. Our results show that the correlations between the predicted and test convection velocities for the machine learning models (> 0.75) are much better than those of the multiple linear regression model (~ 0.23 - 0.43). The systematic analysis reveals that the IMF and magnetospheric activity play an important role in influencing plasma convection in the global magnetotail lobes.", "sentences": ["Machine Learning Solar Wind Driving Magnetospheric Convection in Tail Lobes.", "To quantitatively study the driving mechanisms of magnetospheric convection in the magnetotail lobes on a global scale, we utilize data from the ARTEMIS spacecraft in the deep tail and the Cluster spacecraft in the near tail.", "Previous work demonstrated that, in the lobes near the Moon, we can estimate the convection by utilizing ARTEMIS measurements of lunar ions velocity.", "In this paper, we analyze these datasets with machine learning models to determine what upstream factors drive the lobe convection in different magnetotail regions and thereby understand the mechanisms that control the dynamics of the tail lobes.", "Our results show that the correlations between the predicted and test convection velocities for the machine learning models (> 0.75) are much better than those of the multiple linear regression model (~ 0.23 - 0.43).", "The systematic analysis reveals that the IMF and magnetospheric activity play an important role in influencing plasma convection in the global magnetotail lobes."]}
{"id": "http://arxiv.org/abs/2202.01388", "title": "Self-consistent Gradient-like Eigen Decomposition in Solving Schr\\\"odinger Equations.", "authors": "Xihan Li, Xiang Chen, Rasul Tutunov, Haitham Bou-Ammar, Lei Wang, Jun Wang", "abstract": "The Schr\\\"odinger equation is at the heart of modern quantum mechanics. Since exact solutions of the ground state are typically intractable, standard approaches approximate Schr\\\"odinger equation as forms of nonlinear generalized eigenvalue problems $F(V)V = SV\\Lambda$ in which $F(V)$, the matrix to be decomposed, is a function of its own top-$k$ smallest eigenvectors $V$, leading to a \"self-consistency problem\". Traditional iterative methods heavily rely on high-quality initial guesses of $V$ generated via domain-specific heuristics methods based on quantum mechanics. In this work, we eliminate such a need for domain-specific heuristics by presenting a novel framework, Self-consistent Gradient-like Eigen Decomposition (SCGLED) that regards $F(V)$ as a special \"online data generator\", thus allows gradient-like eigendecomposition methods in streaming $k$-PCA to approach the self-consistency of the equation from scratch in an iterative way similar to online learning. With several critical numerical improvements, SCGLED is robust to initial guesses, free of quantum-mechanism-based heuristics designs, and neat in implementation. Our experiments show that it not only can simply replace traditional heuristics-based initial guess methods with large performance advantage (achieved averagely 25x more precise than the best baseline in similar wall time), but also is capable of finding highly precise solutions independently without any traditional iterative methods.", "sentences": ["Self-consistent Gradient-like Eigen Decomposition in Solving Schr\\\"odinger Equations.", "The Schr\\\"odinger equation is at the heart of modern quantum mechanics.", "Since exact solutions of the ground state are typically intractable, standard approaches approximate Schr\\\"odinger equation as forms of nonlinear generalized eigenvalue problems $F(V)V = SV\\Lambda$ in which $F(V)$, the matrix to be decomposed, is a function of its own top-$k$ smallest eigenvectors $V$, leading to a \"self-consistency problem\".", "Traditional iterative methods heavily rely on high-quality initial guesses of $V$ generated via domain-specific heuristics methods based on quantum mechanics.", "In this work, we eliminate such a need for domain-specific heuristics by presenting a novel framework, Self-consistent Gradient-like Eigen Decomposition (SCGLED) that regards $F(V)$ as a special \"online data generator\", thus allows gradient-like eigendecomposition methods in streaming $k$-PCA to approach the self-consistency of the equation from scratch in an iterative way similar to online learning.", "With several critical numerical improvements, SCGLED is robust to initial guesses, free of quantum-mechanism-based heuristics designs, and neat in implementation.", "Our experiments show that it not only can simply replace traditional heuristics-based initial guess methods with large performance advantage (achieved averagely 25x more precise than the best baseline in similar wall time), but also is capable of finding highly precise solutions independently without any traditional iterative methods."]}
{"id": "http://arxiv.org/abs/2202.01389", "title": "An Empirical Review of Optimization Techniques for Quantum Variational Circuits.", "authors": "Owen Lockwood", "abstract": "Quantum Variational Circuits (QVCs) are often claimed as one of the most potent uses of both near term and long term quantum hardware. The standard approaches to optimizing these circuits rely on a classical system to compute the new parameters at every optimization step. However, this process can be extremely challenging both in terms of navigating the exponentially scaling complex Hilbert space, barren plateaus, and the noise present in all foreseeable quantum hardware. Although a variety of optimization algorithms are employed in practice, there is often a lack of theoretical or empirical motivations for this choice. To this end we empirically evaluate the potential of many common gradient and gradient free optimizers on a variety of optimization tasks. These tasks include both classical and quantum data based optimization routines. Our evaluations were conducted in both noise free and noisy simulations. The large number of problems and optimizers yields strong empirical guidance for choosing optimizers for QVCs that is currently lacking.", "sentences": ["An Empirical Review of Optimization Techniques for Quantum Variational Circuits.", "Quantum Variational Circuits (QVCs) are often claimed as one of the most potent uses of both near term and long term quantum hardware.", "The standard approaches to optimizing these circuits rely on a classical system to compute the new parameters at every optimization step.", "However, this process can be extremely challenging both in terms of navigating the exponentially scaling complex Hilbert space, barren plateaus, and the noise present in all foreseeable quantum hardware.", "Although a variety of optimization algorithms are employed in practice, there is often a lack of theoretical or empirical motivations for this choice.", "To this end we empirically evaluate the potential of many common gradient and gradient free optimizers on a variety of optimization tasks.", "These tasks include both classical and quantum data based optimization routines.", "Our evaluations were conducted in both noise free and noisy simulations.", "The large number of problems and optimizers yields strong empirical guidance for choosing optimizers for QVCs that is currently lacking."]}
{"id": "http://arxiv.org/abs/2202.01390", "title": "Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "authors": "Joachim Gudmundsson, Martin P. Seybold, John Pfeifer", "abstract": "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning. We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.  Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement. We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification. The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.  We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies. Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches.", "sentences": ["Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language.", "Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning.", "We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people.", "Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement.", "We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification.", "The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning.", "We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies.", "Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches."]}
{"id": "http://arxiv.org/abs/2202.01391", "title": "Fair Representation Clustering with Several Protected Classes.", "authors": "Zhen Dai, Yury Makarychev, Ali Vakilian", "abstract": "We study the problem of fair $k$-median where each cluster is required to have a fair representation of individuals from different groups. In the fair representation $k$-median problem, we are given a set of points $X$ in a metric space. Each point $x\\in X$ belongs to one of $\\ell$ groups. Further, we are given fair representation parameters $\\alpha_j$ and $\\beta_j$ for each group $j\\in [\\ell]$. We say that a $k$-clustering $C_1, \\cdots, C_k$ fairly represents all groups if the number of points from group $j$ in cluster $C_i$ is between $\\alpha_j |C_i|$ and $\\beta_j |C_i|$ for every $j\\in[\\ell]$ and $i\\in [k]$. The goal is to find a set $\\mathcal{C}$ of $k$ centers and an assignment $\\phi: X\\rightarrow \\mathcal{C}$ such that the clustering defined by $(\\mathcal{C}, \\phi)$ fairly represents all groups and minimizes the $\\ell_1$-objective $\\sum_{x\\in X} d(x, \\phi(x))$.  We present an $O(\\log k)$-approximation algorithm that runs in time $n^{O(\\ell)}$. Note that the known algorithms for the problem either (i) violate the fairness constraints by an additive term or (ii) run in time that is exponential in both $k$ and $\\ell$. We also consider an important special case of the problem where $\\alpha_j = \\beta_j = \\frac{f_j}{f}$ and $f_j, f \\in \\mathbb{N}$ for all $j\\in [\\ell]$. For this special case, we present an $O(\\log k)$-approximation algorithm that runs in $(kf)^{O(\\ell)}\\log n + poly(n)$ time.", "sentences": ["Fair Representation Clustering with Several Protected Classes.", "We study the problem of fair $k$-median where each cluster is required to have a fair representation of individuals from different groups.", "In the fair representation $k$-median problem, we are given a set of points $X$ in a metric space.", "Each point $x\\in X$ belongs to one of $\\ell$ groups.", "Further, we are given fair representation parameters $\\alpha_j$ and $\\beta_j$ for each group $j\\in [\\ell]$.", "We say that a $k$-clustering $C_1, \\cdots, C_k$ fairly represents all groups if the number of points from group $j$ in cluster $C_i$ is between $\\alpha_j |C_i|$ and $\\beta_j |C_i|$ for every $j\\in[\\ell]$ and $i\\in [k]$.", "The goal is to find a set $\\mathcal{C}$ of $k$ centers and an assignment $\\phi: X\\rightarrow \\mathcal{C}$ such that the clustering defined by $(\\mathcal{C}, \\phi)$ fairly represents all groups and minimizes the $\\ell_1$-objective $\\sum_{x\\in X} d(x, \\phi(x))$.", "We present an $O(\\log k)$-approximation algorithm that runs in time $n^{O(\\ell)}$.", "Note that the known algorithms for the problem either (i) violate the fairness constraints by an additive term or (ii) run in time that is exponential in both $k$ and $\\ell$.", "We also consider an important special case of the problem where $\\alpha_j = \\beta_j = \\frac{f_j}{f}$ and $f_j, f \\in \\mathbb{N}$ for all $j\\in [\\ell]$.", "For this special case, we present an $O(\\log k)$-approximation algorithm that runs in $(kf)^{O(\\ell)}\\log n + poly(n)$ time."]}
{"id": "http://arxiv.org/abs/2202.01397", "title": "Learning with Asymmetric Kernels: Least Squares and Feature Interpretation.", "authors": "Mingzhen He, Fan He, Lei Shi, Xiaolin Huang, Johan A.K. Suykens", "abstract": "Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.", "sentences": ["Learning with Asymmetric Kernels: Least Squares and Feature Interpretation.", "Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs.", "However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels.", "This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly.", "We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known.", "Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels.", "Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels."]}
{"id": "http://arxiv.org/abs/2202.01402", "title": "GALAXY: Graph-based Active Learning at the Extreme.", "authors": "Jifan Zhang, Julian Katz-Samuels, Robert Nowak", "abstract": "Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In \"open world\" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.", "sentences": ["GALAXY: Graph-based Active Learning at the Extreme.", "Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training.", "In \"open world\" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class.", "This leads to extreme class-imbalance, and our theory and methods focus on this core issue.", "We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning.", "GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning.", "Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling.", "Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets."]}
{"id": "http://arxiv.org/abs/2202.01427", "title": "SparGE: Sparse Coding-based Patient Similarity Learning via Low-rank Constraints and Graph Embedding.", "authors": "Xian Wei, See Kiong Ng, Tongtong Zhang, Yingjie Liu", "abstract": "Patient similarity assessment (PSA) is pivotal to evidence-based and personalized medicine, enabled by analyzing the increasingly available electronic health records (EHRs). However, machine learning approaches for PSA has to deal with inherent data deficiencies of EHRs, namely missing values, noise, and small sample sizes. In this work, an end-to-end discriminative learning framework, called SparGE, is proposed to address these data challenges of EHR for PSA. SparGE measures similarity by jointly sparse coding and graph embedding. First, we use low-rank constrained sparse coding to identify and calculate weight for similar patients, while denoising against missing values. Then, graph embedding on sparse representations is adopted to measure the similarity between patient pairs via preserving local relationships defined by distances. Finally, a global cost function is constructed to optimize related parameters. Experimental results on two private and public real-world healthcare datasets, namely SingHEART and MIMIC-III, show that the proposed SparGE significantly outperforms other machine learning patient similarity methods.", "sentences": ["SparGE: Sparse Coding-based Patient Similarity Learning via Low-rank Constraints and Graph Embedding.", "Patient similarity assessment (PSA) is pivotal to evidence-based and personalized medicine, enabled by analyzing the increasingly available electronic health records (EHRs).", "However, machine learning approaches for PSA has to deal with inherent data deficiencies of EHRs, namely missing values, noise, and small sample sizes.", "In this work, an end-to-end discriminative learning framework, called SparGE, is proposed to address these data challenges of EHR for PSA.", "SparGE measures similarity by jointly sparse coding and graph embedding.", "First, we use low-rank constrained sparse coding to identify and calculate weight for similar patients, while denoising against missing values.", "Then, graph embedding on sparse representations is adopted to measure the similarity between patient pairs via preserving local relationships defined by distances.", "Finally, a global cost function is constructed to optimize related parameters.", "Experimental results on two private and public real-world healthcare datasets, namely SingHEART and MIMIC-III, show that the proposed SparGE significantly outperforms other machine learning patient similarity methods."]}
{"id": "http://arxiv.org/abs/2202.01440", "title": "Optimized Potential Initialization for Low-latency Spiking Neural Networks.", "authors": "Tong Bu, Jianhao Ding, Zhaofei Yu, Tiejun Huang", "abstract": "Spiking Neural Networks (SNNs) have been attached great importance due to the distinctive properties of low power consumption, biological plausibility, and adversarial robustness. The most effective way to train deep SNNs is through ANN-to-SNN conversion, which have yielded the best performance in deep network structure and large-scale datasets. However, there is a trade-off between accuracy and latency. In order to achieve high precision as original ANNs, a long simulation time is needed to match the firing rate of a spiking neuron with the activation value of an analog neuron, which impedes the practical application of SNN. In this paper, we aim to achieve high-performance converted SNNs with extremely low latency (fewer than 32 time-steps). We start by theoretically analyzing ANN-to-SNN conversion and show that scaling the thresholds does play a similar role as weight normalization. Instead of introducing constraints that facilitate ANN-to-SNN conversion at the cost of model capacity, we applied a more direct way by optimizing the initial membrane potential to reduce the conversion loss in each layer. Besides, we demonstrate that optimal initialization of membrane potentials can implement expected error-free ANN-to-SNN conversion. We evaluate our algorithm on the CIFAR-10, CIFAR-100 and ImageNet datasets and achieve state-of-the-art accuracy, using fewer time-steps. For example, we reach top-1 accuracy of 93.38\\% on CIFAR-10 with 16 time-steps. Moreover, our method can be applied to other ANN-SNN conversion methodologies and remarkably promote performance when the time-steps is small.", "sentences": ["Optimized Potential Initialization for Low-latency Spiking Neural Networks.", "Spiking Neural Networks (SNNs) have been attached great importance due to the distinctive properties of low power consumption, biological plausibility, and adversarial robustness.", "The most effective way to train deep SNNs is through ANN-to-SNN conversion, which have yielded the best performance in deep network structure and large-scale datasets.", "However, there is a trade-off between accuracy and latency.", "In order to achieve high precision as original ANNs, a long simulation time is needed to match the firing rate of a spiking neuron with the activation value of an analog neuron, which impedes the practical application of SNN.", "In this paper, we aim to achieve high-performance converted SNNs with extremely low latency (fewer than 32 time-steps).", "We start by theoretically analyzing ANN-to-SNN conversion and show that scaling the thresholds does play a similar role as weight normalization.", "Instead of introducing constraints that facilitate ANN-to-SNN conversion at the cost of model capacity, we applied a more direct way by optimizing the initial membrane potential to reduce the conversion loss in each layer.", "Besides, we demonstrate that optimal initialization of membrane potentials can implement expected error-free ANN-to-SNN conversion.", "We evaluate our algorithm on the CIFAR-10, CIFAR-100 and ImageNet datasets and achieve state-of-the-art accuracy, using fewer time-steps.", "For example, we reach top-1 accuracy of 93.38\\% on CIFAR-10 with 16 time-steps.", "Moreover, our method can be applied to other ANN-SNN conversion methodologies and remarkably promote performance when the time-steps is small."]}
{"id": "http://arxiv.org/abs/2202.01448", "title": "Deep Learning Algorithm for Threat Detection in Hackers Forum (Deep Web).", "authors": "Victor Adewopo, Bilal Gonen, Nelly Elsayed, Murat Ozer, Zaghloul Saad Elsayed", "abstract": "In our current society, the inter-connectivity of devices provides easy access for netizens to utilize cyberspace technology for illegal activities. The deep web platform is a consummative ecosystem shielded by boundaries of trust, information sharing, trade-off, and review systems. Domain knowledge is shared among experts in hacker's forums which contain indicators of compromise that can be explored for cyberthreat intelligence. Developing tools that can be deployed for threat detection is integral in securing digital communication in cyberspace. In this paper, we addressed the use of TOR relay nodes for anonymizing communications in deep web forums. We propose a novel approach for detecting cyberthreats using a deep learning algorithm Long Short-Term Memory (LSTM). The developed model outperformed the experimental results of other researchers in this problem domain with an accuracy of 94\\% and precision of 90\\%. Our model can be easily deployed by organizations in securing digital communications and detection of vulnerability exposure before cyberattack.", "sentences": ["Deep Learning Algorithm for Threat Detection in Hackers Forum (Deep Web).", "In our current society, the inter-connectivity of devices provides easy access for netizens to utilize cyberspace technology for illegal activities.", "The deep web platform is a consummative ecosystem shielded by boundaries of trust, information sharing, trade-off, and review systems.", "Domain knowledge is shared among experts in hacker's forums which contain indicators of compromise that can be explored for cyberthreat intelligence.", "Developing tools that can be deployed for threat detection is integral in securing digital communication in cyberspace.", "In this paper, we addressed the use of TOR relay nodes for anonymizing communications in deep web forums.", "We propose a novel approach for detecting cyberthreats using a deep learning algorithm Long Short-Term Memory (LSTM).", "The developed model outperformed the experimental results of other researchers in this problem domain with an accuracy of 94\\% and precision of 90\\%.", "Our model can be easily deployed by organizations in securing digital communications and detection of vulnerability exposure before cyberattack."]}
{"id": "http://arxiv.org/abs/2202.01454", "title": "Deep Hierarchy in Bandits.", "authors": "Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh", "abstract": "Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.", "sentences": ["Deep Hierarchy in Bandits.", "Mean rewards of actions are often correlated.", "The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories.", "To maximize statistical efficiency, it is important to leverage these correlations when learning.", "We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables.", "Since the hierarchy can have multiple layers, we call it deep.", "We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies.", "The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest.", "We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits.", "Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions.", "We confirm these theoretical findings empirically, in both synthetic and real-world experiments."]}
{"id": "http://arxiv.org/abs/2202.01456", "title": "Fast and explainable clustering based on sorting.", "authors": "Xinye Chen, Stefan G\u00fcttel", "abstract": "We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.", "sentences": ["Fast and explainable clustering based on sorting.", "We introduce a fast and explainable clustering method called CLASSIX.", "It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters.", "The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size.", "Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality.", "Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms.", "The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems.", "Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters."]}
{"id": "http://arxiv.org/abs/2202.01459", "title": "Concept Bottleneck Model with Additional Unsupervised Concepts.", "authors": "Yoshihide Sawada, Keigo Nakamura", "abstract": "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications. However, most methods utilize post-hoc approaches rather than training the interpretable model. In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM). CBM uses concept labels to train an intermediate layer as the additional visible layer. However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels. To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs). By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images. We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC). We experimentally confirmed that the proposed model outperformed CBM and SENN. We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings.", "sentences": ["Concept Bottleneck Model with Additional Unsupervised Concepts.", "With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications.", "However, most methods utilize post-hoc approaches rather than training the interpretable model.", "In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM).", "CBM uses concept labels to train an intermediate layer as the additional visible layer.", "However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels.", "To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs).", "By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images.", "We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC).", "We experimentally confirmed that the proposed model outperformed CBM and SENN.", "We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings."]}
{"id": "http://arxiv.org/abs/2202.01461", "title": "ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search.", "authors": "Dixant Mittal, Siddharth Arvindan, Wee Sun Lee", "abstract": "A tree-based online search algorithm iteratively simulates trajectories and updates Q-value information on a set of states represented by a tree structure. Alternatively, policy gradient based online search algorithms update the information obtained from simulated trajectories directly onto the parameters of the policy and has been found to be effective. While tree-based methods limit the updates from simulations to the states that exist in the tree and do not interpolate the information to nearby states, policy gradient search methods do not do explicit exploration. In this paper, we show that it is possible to combine and leverage the strengths of these two methods for improved search performance. We examine the key reasons behind the improvement and propose a simple yet effective online search method, named Exploratory Policy Gradient Search (ExPoSe), that updates both the parameters of the policy as well as search information on the states in the trajectory. We conduct experiments on complex planning problems, which include Sokoban and Hamiltonian cycle search in sparse graphs and show that combining exploration with policy gradient improves online search performance.", "sentences": ["ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search.", "A tree-based online search algorithm iteratively simulates trajectories and updates Q-value information on a set of states represented by a tree structure.", "Alternatively, policy gradient based online search algorithms update the information obtained from simulated trajectories directly onto the parameters of the policy and has been found to be effective.", "While tree-based methods limit the updates from simulations to the states that exist in the tree and do not interpolate the information to nearby states, policy gradient search methods do not do explicit exploration.", "In this paper, we show that it is possible to combine and leverage the strengths of these two methods for improved search performance.", "We examine the key reasons behind the improvement and propose a simple yet effective online search method, named Exploratory Policy Gradient Search (ExPoSe), that updates both the parameters of the policy as well as search information on the states in the trajectory.", "We conduct experiments on complex planning problems, which include Sokoban and Hamiltonian cycle search in sparse graphs and show that combining exploration with policy gradient improves online search performance."]}
{"id": "http://arxiv.org/abs/2202.01463", "title": "Minimax rate of consistency for linear models with missing values.", "authors": "Alexis Ayme (LPSM (UMR\\_8001)), Claire Boyer (LPSM (UMR\\_8001), MOKAPLAN), Aymeric Dieuleveut (CMAP), Erwan Scornet (CMAP)", "abstract": "Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.", "sentences": ["Minimax rate of consistency for linear models with missing values.", "Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...).", "In fact, the very nature of missing values usually prevents us from running standard learning algorithms.", "In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task.", "Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern.", "This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.", "First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension.", "Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal.", "Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values."]}
{"id": "http://arxiv.org/abs/2202.01468", "title": "A unified surrogate-based scheme for black-box and preference-based optimization.", "authors": "Davide Previtali, Mirko Mazzoleni, Antonio Ferramosca, Fabio Previdi", "abstract": "Black-box and preference-based optimization algorithms are global optimization procedures that aim to find the global solutions of an optimization problem using, respectively, the least amount of function evaluations or sample comparisons as possible. In the black-box case, the analytical expression of the objective function is unknown and it can only be evaluated through a (costly) computer simulation or an experiment. In the preference-based case, the objective function is still unknown but it corresponds to the subjective criterion of an individual. So, it is not possible to quantify such criterion in a reliable and consistent way. Therefore, preference-based optimization algorithms seek global solutions using only comparisons between couples of different samples, for which a human decision-maker indicates which of the two is preferred. Quite often, the black-box and preference-based frameworks are covered separately and are handled using different techniques. In this paper, we show that black-box and preference-based optimization problems are closely related and can be solved using the same family of approaches, namely surrogate-based methods. Moreover, we propose the generalized Metric Response Surface (gMRS) algorithm, an optimization scheme that is a generalization of the popular MSRS framework. Finally, we provide a convergence proof for the proposed optimization method.", "sentences": ["A unified surrogate-based scheme for black-box and preference-based optimization.", "Black-box and preference-based optimization algorithms are global optimization procedures that aim to find the global solutions of an optimization problem using, respectively, the least amount of function evaluations or sample comparisons as possible.", "In the black-box case, the analytical expression of the objective function is unknown and it can only be evaluated through a (costly) computer simulation or an experiment.", "In the preference-based case, the objective function is still unknown but it corresponds to the subjective criterion of an individual.", "So, it is not possible to quantify such criterion in a reliable and consistent way.", "Therefore, preference-based optimization algorithms seek global solutions using only comparisons between couples of different samples, for which a human decision-maker indicates which of the two is preferred.", "Quite often, the black-box and preference-based frameworks are covered separately and are handled using different techniques.", "In this paper, we show that black-box and preference-based optimization problems are closely related and can be solved using the same family of approaches, namely surrogate-based methods.", "Moreover, we propose the generalized Metric Response Surface (gMRS) algorithm, an optimization scheme that is a generalization of the popular MSRS framework.", "Finally, we provide a convergence proof for the proposed optimization method."]}
{"id": "http://arxiv.org/abs/2202.01479", "title": "MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation.", "authors": "Guanxiong Luo, Martin Heide, Martin Uecker", "abstract": "We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold accelerated acquisition.", "sentences": ["MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation.", "We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction.", "Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method.", "In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed.", "The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement.", "This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models.", "Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models.", "The performance of the method is evaluated on an open dataset using 10-fold accelerated acquisition."]}
{"id": "http://arxiv.org/abs/2202.01487", "title": "A benchmark of state-of-the-art sound event detection systems evaluated on synthetic soundscapes.", "authors": "Francesca Ronchini, Romain Serizel", "abstract": "This paper proposes a benchmark of submissions to Detection and Classification Acoustic Scene and Events 2021 Challenge (DCASE) Task 4 representing a sampling of the state-of-the-art in Sound Event Detection task. The submissions are evaluated according to the two polyphonic sound detection score scenarios proposed for the DCASE 2021 Challenge Task 4, which allow to make an analysis on whether submissions are designed to perform fine-grained temporal segmentation, coarse-grained temporal segmentation, or have been designed to be polyvalent on the scenarios proposed. We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events. A last experiment is proposed in order to study the impact of non-target events on systems outputs. Results show that systems adapted to provide coarse segmentation outputs are more robust to different target to non-target signal-to-noise ratio and, with the help of specific data augmentation methods, they are more robust to time localization of the original event. Results of the last experiment display that systems tend to spuriously predict short events when non-target events are present. This is particularly true for systems that are tailored to have a fine segmentation.", "sentences": ["A benchmark of state-of-the-art sound event detection systems evaluated on synthetic soundscapes.", "This paper proposes a benchmark of submissions to Detection and Classification Acoustic Scene and Events 2021 Challenge (DCASE) Task 4 representing a sampling of the state-of-the-art in Sound Event Detection task.", "The submissions are evaluated according to the two polyphonic sound detection score scenarios proposed for the DCASE 2021 Challenge Task 4, which allow to make an analysis on whether submissions are designed to perform fine-grained temporal segmentation, coarse-grained temporal segmentation, or have been designed to be polyvalent on the scenarios proposed.", "We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.", "A last experiment is proposed in order to study the impact of non-target events on systems outputs.", "Results show that systems adapted to provide coarse segmentation outputs are more robust to different target to non-target signal-to-noise ratio and, with the help of specific data augmentation methods, they are more robust to time localization of the original event.", "Results of the last experiment display that systems tend to spuriously predict short events when non-target events are present.", "This is particularly true for systems that are tailored to have a fine segmentation."]}
{"id": "http://arxiv.org/abs/2202.01511", "title": "Challenging Common Assumptions in Convex Reinforcement Learning.", "authors": "Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli", "abstract": "The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.", "sentences": ["Challenging Common Assumptions in Convex Reinforcement Learning.", "The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function.", "More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy.", "Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration.", "In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice.", "This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy.", "In this paper, we show that this hidden assumption does not hold in the convex RL setting.", "In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error.", "Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others."]}
{"id": "http://arxiv.org/abs/2202.01512", "title": "Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT.", "authors": "Zonghang Li, Yihong He, Hongfang Yu, Jiawen Kang, Xiaoping Li, Zenglin Xu, Dusit Niyato", "abstract": "Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence. These data locate on decentralized devices in modern factories. To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models. However, the local data collected by different devices skew in class distribution and degrade industrial FL performance. This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d. data. Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training. Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity. The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation. We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient. Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d. data.", "sentences": ["Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT.", "Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence.", "These data locate on decentralized devices in modern factories.", "To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models.", "However, the local data collected by different devices skew in class distribution and degrade industrial FL performance.", "This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security.", "In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d.", "data.", "Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training.", "Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity.", "The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation.", "We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient.", "Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d.", "data."]}
{"id": "http://arxiv.org/abs/2202.01529", "title": "Comparative assessment of federated and centralized machine learning.", "authors": "Ibrahim Abdul Majeed, Sagar Kaushik, Aniruddha Bardhan, Venkata Siva Kumar Tadi, Hwang-Ki Min, Karthikeyan Kumaraguru, Rajasekhara Duvvuru Muni", "abstract": "Federated Learning (FL) is a privacy preserving machine learning scheme, where training happens with data federated across devices and not leaving them to sustain user privacy. This is ensured by making the untrained or partially trained models to reach directly the individual devices and getting locally trained \"on-device\" using the device owned data, and the server aggregating all the partially trained model learnings to update a global model. Although almost all the model learning schemes in the federated learning setup use gradient descent, there are certain characteristic differences brought about by the non-IID nature of the data availability, that affects the training in comparison to the centralized schemes. In this paper, we discuss the various factors that affect the federated learning training, because of the non-IID distributed nature of the data, as well as the inherent differences in the federating learning approach as against the typical centralized gradient descent techniques. We empirically demonstrate the effect of number of samples per device and the distribution of output labels on federated learning. In addition to the privacy advantage we seek through federated learning, we also study if there is a cost advantage while using federated learning frameworks. We show that federated learning does have an advantage in cost when the model sizes to be trained are not reasonably large. All in all, we present the need for careful design of model for both performance and cost.", "sentences": ["Comparative assessment of federated and centralized machine learning.", "Federated Learning (FL) is a privacy preserving machine learning scheme, where training happens with data federated across devices and not leaving them to sustain user privacy.", "This is ensured by making the untrained or partially trained models to reach directly the individual devices and getting locally trained \"on-device\" using the device owned data, and the server aggregating all the partially trained model learnings to update a global model.", "Although almost all the model learning schemes in the federated learning setup use gradient descent, there are certain characteristic differences brought about by the non-IID nature of the data availability, that affects the training in comparison to the centralized schemes.", "In this paper, we discuss the various factors that affect the federated learning training, because of the non-IID distributed nature of the data, as well as the inherent differences in the federating learning approach as against the typical centralized gradient descent techniques.", "We empirically demonstrate the effect of number of samples per device and the distribution of output labels on federated learning.", "In addition to the privacy advantage we seek through federated learning, we also study if there is a cost advantage while using federated learning frameworks.", "We show that federated learning does have an advantage in cost when the model sizes to be trained are not reasonably large.", "All in all, we present the need for careful design of model for both performance and cost."]}
{"id": "http://arxiv.org/abs/2202.01534", "title": "Influence-Augmented Local Simulators: A Scalable Solution for Fast Deep RL in Large Networked Systems.", "authors": "Miguel Suau, Jinke He, Matthijs T. J. Spaan, Frans A. Oliehoek", "abstract": "Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future.", "sentences": ["Influence-Augmented Local Simulators: A Scalable Solution for Fast Deep RL in Large Networked Systems.", "Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL).", "The main limitation being the amount of data needed and the pace at which that data can be obtained.", "In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable.", "We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics.", "Our method combines the use of local simulators with learned models that mimic the influence of the global system.", "The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future."]}
{"id": "http://arxiv.org/abs/2202.01545", "title": "Byzantine-Robust Decentralized Learning via Self-Centered Clipping.", "authors": "Lie He, Sai Praneeth Karimireddy, Martin Jaggi", "abstract": "In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus. We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration. To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\\delta_{\\max}\\zeta^2/\\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks.", "sentences": ["Byzantine-Robust Decentralized Learning via Self-Centered Clipping.", "In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs.", "Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus.", "We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration.", "To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\\delta_{\\max}\\zeta^2/\\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions.", "Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks."]}
{"id": "http://arxiv.org/abs/2202.01562", "title": "Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.", "authors": "Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto", "abstract": "In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.", "sentences": ["Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.", "In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical.", "Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data.", "Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space.", "To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable.", "However, an unrealistic assumption may, in turn, cause serious bias.", "Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies.", "To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking.", "We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions.", "Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate.", "Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings."]}
{"id": "http://arxiv.org/abs/2202.01566", "title": "Unified theory of atom-centered representations and graph convolutional machine-learning schemes.", "authors": "Jigyasa Nigam, Guillaume Fraux, Michele Ceriotti", "abstract": "Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents. Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity. Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets. Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom. We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes.", "sentences": ["Unified theory of atom-centered representations and graph convolutional machine-learning schemes.", "Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents.", "Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity.", "Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets.", "Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom.", "We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes."]}
{"id": "http://arxiv.org/abs/2202.01571", "title": "Toric Geometry of Entropic Regularization.", "authors": "Bernd Sturmfels, Simon Telen, Fran\u00e7ois-Xavier Vialard, Max von Renesse", "abstract": "Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.", "sentences": ["Toric Geometry of Entropic Regularization.", "Entropic regularization is a method for large-scale linear programming.", "Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point.", "We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center.", "We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings.", "We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling."]}
{"id": "http://arxiv.org/abs/2202.01575", "title": "CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting.", "authors": "Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, Steven Hoi", "abstract": "Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers. Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective. Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations. CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively. Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3\\% improvement in MSE on multivariate benchmarks. It is also robust to various choices of backbone encoders, as well as downstream regressors.", "sentences": ["CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting.", "Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers.", "Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective.", "Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations.", "CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively.", "Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3\\% improvement in MSE on multivariate benchmarks.", "It is also robust to various choices of backbone encoders, as well as downstream regressors."]}
{"id": "http://arxiv.org/abs/2202.01602", "title": "The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective.", "authors": "Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, Himabindu Lakkaraju", "abstract": "As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements. To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output. Our findings underscore the importance of developing principled evaluation metrics that enable practitioners to effectively compare explanations.", "sentences": ["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective.", "As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice.", "However, there is little to no research that provides answers to these critical questions.", "In this work, we introduce and study the disagreement problem in explainable machine learning.", "More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements.", "To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding.", "We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods.", "In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements.", "Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output.", "Our findings underscore the importance of developing principled evaluation metrics that enable practitioners to effectively compare explanations."]}
{"id": "http://arxiv.org/abs/2202.01606", "title": "Graph Coloring with Physics-Inspired Graph Neural Networks.", "authors": "Martin J. A. Schuetz, J. Kyle Brubaker, Zhihuai Zhu, Helmut G. Katzgraber", "abstract": "We show how graph neural networks can be used to solve the canonical graph coloring problem. We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model. Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward. We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework. Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables.", "sentences": ["Graph Coloring with Physics-Inspired Graph Neural Networks.", "We show how graph neural networks can be used to solve the canonical graph coloring problem.", "We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model.", "Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward.", "We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework.", "Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables."]}
{"id": "http://arxiv.org/abs/2202.01615", "title": "Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics.", "authors": "Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon Tantipongpipat, Kristian Lum, Ferenc Huszar, Rumman Chowdhury", "abstract": "The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups. However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases. Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings. In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline. We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners. We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users. Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks.", "sentences": ["Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics.", "The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases.", "Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups.", "However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases.", "Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings.", "In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline.", "We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners.", "We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria.", "We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users.", "Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks."]}
{"id": "http://arxiv.org/abs/2202.01619", "title": "On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres.", "authors": "Benyamin Ghojogh, Fakhri Karray, Mark Crowley", "abstract": "Consider a set of $n$ data points in the Euclidean space $\\mathbb{R}^d$. This set is called dataset in machine learning and data science. Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability. All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis. In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional. Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$. Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\\{1, 2, \\dots, d-1\\}$. For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data. For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface. Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps. We connect the hyper-caps to one another using partial hyper-cylinders. By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements. We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity. Some discussion are also provided for the linearity and structure of data. This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology.", "sentences": ["On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres.", "Consider a set of $n$ data points in the Euclidean space $\\mathbb{R}^d$.", "This set is called dataset in machine learning and data science.", "Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability.", "All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis.", "In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional.", "Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$.", "Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\\{1, 2, \\dots, d-1\\}$.", "For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data.", "For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface.", "Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps.", "We connect the hyper-caps to one another using partial hyper-cylinders.", "By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements.", "We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity.", "Some discussion are also provided for the linearity and structure of data.", "This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology."]}
{"id": "http://arxiv.org/abs/2202.01627", "title": "Non-Vacuous Generalisation Bounds for Shallow Neural Networks.", "authors": "Felix Biggs, Benjamin Guedj", "abstract": "We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function (\"erf\") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.", "sentences": ["Non-Vacuous Generalisation Bounds for Shallow Neural Networks.", "We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function (\"erf\") activation or a Gaussian Error Linear Unit (GELU) activation.", "For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters.", "Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST."]}
{"id": "http://arxiv.org/abs/2202.01653", "title": "Learning strides in convolutional neural networks.", "authors": "Rachid Riad, Olivier Teboul, David Grangier, Neil Zeghidour", "abstract": "Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.", "sentences": ["Learning strides in convolutional neural networks.", "Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations.", "This provides some shift-invariance while reducing the computational complexity of the whole architecture.", "A critical hyperparameter of such layers is their stride: the integer factor of downsampling.", "As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g.", "architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers.", "Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost.", "This work introduces DiffStride, the first downsampling layer with learnable strides.", "Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way.", "Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them.", "In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations.", "Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture.", "We show how this regularization allows trading off accuracy for efficiency on ImageNet."]}
{"id": "http://arxiv.org/abs/2202.01661", "title": "Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "authors": "Anay Mehrotra, Bary S. R. Pradelski, Nisheeth K. Vishnoi", "abstract": "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.", "sentences": ["Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker.", "Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection.", "Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group.", "However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality.", "We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias.", "On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered.", "Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality."]}
{"id": "http://arxiv.org/abs/2202.01664", "title": "Removing Distortion Effects in Music Using Deep Neural Networks.", "authors": "Johannes Imort, Giorgio Fabbro, Marco A. Mart\u00ednez Ram\u00edrez, Stefan Uhlich, Yuichiro Koyama, Yuki Mitsufuji", "abstract": "Audio effects are an essential element in the context of music production, and therefore, modeling analog audio effects has been extensively researched for decades using system-identification methods, circuit simulation, and recently, deep learning. However, only few works tackled the reconstruction of signals that were processed using an audio effect unit. Given the recent advances in music source separation and automatic mixing, the removal of audio effects could facilitate an automatic remixing system. This paper focuses on removing distortion and clipping applied to guitar tracks for music production while presenting a comparative investigation of different deep neural network (DNN) architectures on this task. We achieve exceptionally good results in distortion removal using DNNs for effects that superimpose the clean signal to the distorted signal, while the task is more challenging if the clean signal is not superimposed. Nevertheless, in the latter case, the neural models under evaluation surpass one state-of-the-art declipping system in terms of source-to-distortion ratio, leading to better quality and faster inference.", "sentences": ["Removing Distortion Effects in Music Using Deep Neural Networks.", "Audio effects are an essential element in the context of music production, and therefore, modeling analog audio effects has been extensively researched for decades using system-identification methods, circuit simulation, and recently, deep learning.", "However, only few works tackled the reconstruction of signals that were processed using an audio effect unit.", "Given the recent advances in music source separation and automatic mixing, the removal of audio effects could facilitate an automatic remixing system.", "This paper focuses on removing distortion and clipping applied to guitar tracks for music production while presenting a comparative investigation of different deep neural network (DNN) architectures on this task.", "We achieve exceptionally good results in distortion removal using DNNs for effects that superimpose the clean signal to the distorted signal, while the task is more challenging if the clean signal is not superimposed.", "Nevertheless, in the latter case, the neural models under evaluation surpass one state-of-the-art declipping system in terms of source-to-distortion ratio, leading to better quality and faster inference."]}
{"id": "http://arxiv.org/abs/2202.01665", "title": "On Monte Carlo Tree Search for Weighted Vertex Coloring.", "authors": "Cyril Grelier, Olivier Goudet, Jin-Kao Hao", "abstract": "This work presents the first study of using the popular Monte Carlo Tree Search (MCTS) method combined with dedicated heuristics for solving the Weighted Vertex Coloring Problem. Starting with the basic MCTS algorithm, we gradually introduce a number of algorithmic variants where MCTS is extended by various simulation strategies including greedy and local search heuristics. We conduct experiments on well-known benchmark instances to assess the value of each studied combination. We also provide empirical evidence to shed light on the advantages and limits of each strategy.", "sentences": ["On Monte Carlo Tree Search for Weighted Vertex Coloring.", "This work presents the first study of using the popular Monte Carlo Tree Search (MCTS) method combined with dedicated heuristics for solving the Weighted Vertex Coloring Problem.", "Starting with the basic MCTS algorithm, we gradually introduce a number of algorithmic variants where MCTS is extended by various simulation strategies including greedy and local search heuristics.", "We conduct experiments on well-known benchmark instances to assess the value of each studied combination.", "We also provide empirical evidence to shed light on the advantages and limits of each strategy."]}
{"id": "http://arxiv.org/abs/2202.01666", "title": "Equality Is Not Equity: Proportional Fairness in Federated Learning.", "authors": "Guojun Zhang, Saber Malekmohammadi, Xi Chen, Yaoliang Yu", "abstract": "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers. This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients. Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance. In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties. We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity.", "sentences": ["Equality Is Not Equity: Proportional Fairness in Federated Learning.", "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers.", "This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients.", "Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance.", "In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory.", "By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions.", "Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties.", "We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity."]}
{"id": "http://arxiv.org/abs/2202.01670", "title": "Ranking with Confidence for Large Scale Comparison Data.", "authors": "Filipa Valdeira, Cl\u00e1udia Soares", "abstract": "In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking algorithm from pairwise comparisons that produces a measure of confidence on each comparison. The problem of ranking a large number of items from noisy and sparse pairwise comparison data arises in diverse applications, like ranking players in online games, document retrieval or ranking human perceptions. Although different algorithms are available, we need fast, large-scale algorithms whose accuracy degrades gracefully when the number of comparisons is too small. Fitting our proposed model entails solving a non-convex optimization problem, which we tightly approximate by a sum of quasi-convex functions and a regularization term. Resorting to an iterative reweighted minimization and the Primal-Dual Hybrid Gradient method, we obtain PD-Rank, achieving a Kendall tau 0.1 higher than all comparing methods, even for 10\\% of wrong comparisons in simulated data matching our data model, and leading in accuracy if data is generated according to the Bradley-Terry model, in both cases faster by one order of magnitude, in seconds. In real data, PD-Rank requires less computational time to achieve the same Kendall tau than active learning methods.", "sentences": ["Ranking with Confidence for Large Scale Comparison Data.", "In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking algorithm from pairwise comparisons that produces a measure of confidence on each comparison.", "The problem of ranking a large number of items from noisy and sparse pairwise comparison data arises in diverse applications, like ranking players in online games, document retrieval or ranking human perceptions.", "Although different algorithms are available, we need fast, large-scale algorithms whose accuracy degrades gracefully when the number of comparisons is too small.", "Fitting our proposed model entails solving a non-convex optimization problem, which we tightly approximate by a sum of quasi-convex functions and a regularization term.", "Resorting to an iterative reweighted minimization and the Primal-Dual Hybrid Gradient method, we obtain PD-Rank, achieving a Kendall tau 0.1 higher than all comparing methods, even for 10\\% of wrong comparisons in simulated data matching our data model, and leading in accuracy if data is generated according to the Bradley-Terry model, in both cases faster by one order of magnitude, in seconds.", "In real data, PD-Rank requires less computational time to achieve the same Kendall tau than active learning methods."]}
{"id": "http://arxiv.org/abs/2202.01671", "title": "Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets.", "authors": "Tal Shnitzer, Mikhail Yurochkin, Kristjan Greenewald, Justin Solomon", "abstract": "The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically compare such operators in a pointwise manner or assume known data alignment. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.", "sentences": ["Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets.", "The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets.", "We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples.", "Existing methods typically compare such operators in a pointwise manner or assume known data alignment.", "Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric.", "Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities.", "Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains."]}
{"id": "http://arxiv.org/abs/2202.01672", "title": "SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification.", "authors": "Sayed Hashim, Muhammad Ali, Karthik Nandakumar, Mohammad Yaqub", "abstract": "For personalized medicines, very crucial intrinsic information is present in high dimensional omics data which is difficult to capture due to the large number of molecular features and small number of available samples. Different types of omics data show various aspects of samples. Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making. Omics data, mainly DNA methylation and gene expression profiles are usually high dimensional data with a lot of molecular features. In recent years, variational autoencoders (VAE) have been extensively used in embedding image and text data into lower dimensional latent spaces. In our project, we extend the idea of using a VAE model for low dimensional latent space extraction with the self-supervised learning technique of feature subsetting. With VAEs, the key idea is to make the model learn meaningful representations from different types of omics data, which could then be used for downstream tasks such as cancer type classification. The main goals are to overcome the curse of dimensionality and integrate methylation and expression data to combine information about different aspects of same tissue samples, and hopefully extract biologically relevant features. Our extension involves training encoder and decoder to reconstruct the data from just a subset of it. By doing this, we force the model to encode most important information in the latent representation. We also added an identity to the subsets so that the model knows which subset is being fed into it during training and testing. We experimented with our approach and found that SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data. This work can be improved to integrate mutation-based genomic data as well.", "sentences": ["SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification.", "For personalized medicines, very crucial intrinsic information is present in high dimensional omics data which is difficult to capture due to the large number of molecular features and small number of available samples.", "Different types of omics data show various aspects of samples.", "Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making.", "Omics data, mainly DNA methylation and gene expression profiles are usually high dimensional data with a lot of molecular features.", "In recent years, variational autoencoders (VAE) have been extensively used in embedding image and text data into lower dimensional latent spaces.", "In our project, we extend the idea of using a VAE model for low dimensional latent space extraction with the self-supervised learning technique of feature subsetting.", "With VAEs, the key idea is to make the model learn meaningful representations from different types of omics data, which could then be used for downstream tasks such as cancer type classification.", "The main goals are to overcome the curse of dimensionality and integrate methylation and expression data to combine information about different aspects of same tissue samples, and hopefully extract biologically relevant features.", "Our extension involves training encoder and decoder to reconstruct the data from just a subset of it.", "By doing this, we force the model to encode most important information in the latent representation.", "We also added an identity to the subsets so that the model knows which subset is being fed into it during training and testing.", "We experimented with our approach and found that SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data.", "This work can be improved to integrate mutation-based genomic data as well."]}
{"id": "http://arxiv.org/abs/2202.01677", "title": "Separating Rule Discovery and Global Solution Composition in a Learning Classifier System.", "authors": "Michael Heider, Helena Stegherr, Jonathan Wurth, Roman Sraj, J\u00f6rg H\u00e4hner", "abstract": "The utilization of digital agents to support crucial decision making is increasing in many industrial scenarios. However, trust in suggestions made by these agents is hard to achieve, though essential for profiting from their application, resulting in a need for explanations for both the decision making process as well as the model itself. For many systems, such as common deep learning black-box models, achieving at least some explainability requires complex post-processing, while other systems profit from being, to a reasonable extent, inherently interpretable. In this paper we propose an easily interpretable rule-based learning system specifically designed and thus especially suited for these scenarios and compare it on a set of regression problems against XCSF, a prominent rule-based learning system with a long research history. One key advantage of our system is that the rules' conditions and which rules compose a solution to the problem are evolved separately. We utilise independent rule fitnesses which allows users to specifically tailor their model structure to fit the given requirements for explainability. We find that the results of SupRB2's evaluation are comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits. This increased control aids in subsequently providing explanations for both the training and the final structure of the model.", "sentences": ["Separating Rule Discovery and Global Solution Composition in a Learning Classifier System.", "The utilization of digital agents to support crucial decision making is increasing in many industrial scenarios.", "However, trust in suggestions made by these agents is hard to achieve, though essential for profiting from their application, resulting in a need for explanations for both the decision making process as well as the model itself.", "For many systems, such as common deep learning black-box models, achieving at least some explainability requires complex post-processing, while other systems profit from being, to a reasonable extent, inherently interpretable.", "In this paper we propose an easily interpretable rule-based learning system specifically designed and thus especially suited for these scenarios and compare it on a set of regression problems against XCSF, a prominent rule-based learning system with a long research history.", "One key advantage of our system is that the rules' conditions and which rules compose a solution to the problem are evolved separately.", "We utilise independent rule fitnesses which allows users to specifically tailor their model structure to fit the given requirements for explainability.", "We find that the results of SupRB2's evaluation are comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits.", "This increased control aids in subsequently providing explanations for both the training and the final structure of the model."]}
{"id": "http://arxiv.org/abs/2202.01679", "title": "Certifying Out-of-Domain Generalization for Blackbox Functions.", "authors": "Maurice Weber, Linyi Li, Boxin Wang, Zhikuan Zhao, Bo Li, Ce Zhang", "abstract": "Certifying the robustness of model performance under bounded data distribution shifts has recently attracted intensive interests under the umbrella of distributional robustness. However, existing techniques either make strong assumptions on the model class and loss functions that can be certified, such as smoothness expressed via Lipschitz continuity of gradients, or require to solve complex optimization problems. As a result, the wider application of these techniques is currently limited by its scalability and flexibility -- these techniques often do not scale to large-scale datasets with modern deep neural networks or cannot handle loss functions which may be non-smooth, such as the 0-1 loss. In this paper, we focus on the problem of certifying distributional robustness for black box models and bounded losses, without other assumptions. We propose a novel certification framework given bounded distance of mean and variance of two distributions. Our certification technique scales to ImageNet-scale datasets, complex models, and a diverse range of loss functions. We then focus on one specific application enabled by such scalability and flexibility, i.e., certifying out-of-domain generalization for large neural networks and loss functions such as accuracy and AUC. We experimentally validate our certification method on a number of datasets, ranging from ImageNet, where we provide the first non-vacuous certified out-of-domain generalization, to smaller classification tasks where we are able to compare with the state-of-the-art and show that our method performs considerably better.", "sentences": ["Certifying Out-of-Domain Generalization for Blackbox Functions.", "Certifying the robustness of model performance under bounded data distribution shifts has recently attracted intensive interests under the umbrella of distributional robustness.", "However, existing techniques either make strong assumptions on the model class and loss functions that can be certified, such as smoothness expressed via Lipschitz continuity of gradients, or require to solve complex optimization problems.", "As a result, the wider application of these techniques is currently limited by its scalability and flexibility -- these techniques often do not scale to large-scale datasets with modern deep neural networks or cannot handle loss functions which may be non-smooth, such as the 0-1 loss.", "In this paper, we focus on the problem of certifying distributional robustness for black box models and bounded losses, without other assumptions.", "We propose a novel certification framework given bounded distance of mean and variance of two distributions.", "Our certification technique scales to ImageNet-scale datasets, complex models, and a diverse range of loss functions.", "We then focus on one specific application enabled by such scalability and flexibility, i.e., certifying out-of-domain generalization for large neural networks and loss functions such as accuracy and AUC.", "We experimentally validate our certification method on a number of datasets, ranging from ImageNet, where we provide the first non-vacuous certified out-of-domain generalization, to smaller classification tasks where we are able to compare with the state-of-the-art and show that our method performs considerably better."]}
{"id": "http://arxiv.org/abs/2202.01682", "title": "How to build a cognitive map: insights from models of the hippocampal formation.", "authors": "James C.R. Whittington, David McCaffary, Jacob J.W. Bakermans, Timothy E.J. Behrens", "abstract": "Learning and interpreting the structure of the environment is an innate feature of biological systems, and is integral to guiding flexible behaviours for evolutionary viability. The concept of a cognitive map has emerged as one of the leading metaphors for these capacities, and unravelling the learning and neural representation of such a map has become a central focus of neuroscience. While experimentalists are providing a detailed picture of the neural substrate of cognitive maps in hippocampus and beyond, theorists have been busy building models to bridge the divide between neurons, computation, and behaviour. These models can account for a variety of known representations and neural phenomena, but often provide a differing understanding of not only the underlying principles of cognitive maps, but also the respective roles of hippocampus and cortex. In this Perspective, we bring many of these models into a common language, distil their underlying principles of constructing cognitive maps, provide novel (re)interpretations for neural phenomena, suggest how the principles can be extended to account for prefrontal cortex representations and, finally, speculate on the role of cognitive maps in higher cognitive capacities.", "sentences": ["How to build a cognitive map: insights from models of the hippocampal formation.", "Learning and interpreting the structure of the environment is an innate feature of biological systems, and is integral to guiding flexible behaviours for evolutionary viability.", "The concept of a cognitive map has emerged as one of the leading metaphors for these capacities, and unravelling the learning and neural representation of such a map has become a central focus of neuroscience.", "While experimentalists are providing a detailed picture of the neural substrate of cognitive maps in hippocampus and beyond, theorists have been busy building models to bridge the divide between neurons, computation, and behaviour.", "These models can account for a variety of known representations and neural phenomena, but often provide a differing understanding of not only the underlying principles of cognitive maps, but also the respective roles of hippocampus and cortex.", "In this Perspective, we bring many of these models into a common language, distil their underlying principles of constructing cognitive maps, provide novel (re)interpretations for neural phenomena, suggest how the principles can be extended to account for prefrontal cortex representations and, finally, speculate on the role of cognitive maps in higher cognitive capacities."]}
{"id": "http://arxiv.org/abs/2202.01694", "title": "Variational Nearest Neighbor Gaussian Processes.", "authors": "Luhuan Wu, Geoff Pleiss, John Cunningham", "abstract": "Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.", "sentences": ["Variational Nearest Neighbor Gaussian Processes.", "Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix.", "In this work, we instead exploit a sparse approximation of the precision matrix.", "We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure.", "Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$).", "Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location.", "We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods."]}
{"id": "http://arxiv.org/abs/2202.01709", "title": "Towards Coherent and Consistent Use of Entities in Narrative Generation.", "authors": "Pinelopi Papalampidi, Kris Cao, Tomas Kocisky", "abstract": "Large pre-trained language models (LMs) have demonstrated impressive capabilities in generating long, fluent text; however, there is little to no analysis on their ability to maintain entity coherence and consistency. In this work, we focus on the end task of narrative generation and systematically analyse the long-range entity coherence and consistency in generated stories. First, we propose a set of automatic metrics for measuring model performance in terms of entity usage. Given these metrics, we quantify the limitations of current LMs. Next, we propose augmenting a pre-trained LM with a dynamic entity memory in an end-to-end manner by using an auxiliary entity-related loss for guiding the reads and writes to the memory. We demonstrate that the dynamic entity memory increases entity coherence according to both automatic and human judgment and helps preserving entity-related information especially in settings with a limited context window. Finally, we also validate that our automatic metrics are correlated with human ratings and serve as a good indicator of the quality of generated stories.", "sentences": ["Towards Coherent and Consistent Use of Entities in Narrative Generation.", "Large pre-trained language models (LMs) have demonstrated impressive capabilities in generating long, fluent text; however, there is little to no analysis on their ability to maintain entity coherence and consistency.", "In this work, we focus on the end task of narrative generation and systematically analyse the long-range entity coherence and consistency in generated stories.", "First, we propose a set of automatic metrics for measuring model performance in terms of entity usage.", "Given these metrics, we quantify the limitations of current LMs.", "Next, we propose augmenting a pre-trained LM with a dynamic entity memory in an end-to-end manner by using an auxiliary entity-related loss for guiding the reads and writes to the memory.", "We demonstrate that the dynamic entity memory increases entity coherence according to both automatic and human judgment and helps preserving entity-related information especially in settings with a limited context window.", "Finally, we also validate that our automatic metrics are correlated with human ratings and serve as a good indicator of the quality of generated stories."]}
{"id": "http://arxiv.org/abs/2202.01712", "title": "Review of automated time series forecasting pipelines.", "authors": "Stefan Meisenbacher, Marian Turowski, Kaleb Phipps, Martin R\u00e4tz, Dirk M\u00fcller, Veit Hagenmeyer, Ralf Mikut", "abstract": "Time series forecasting is fundamental for various use cases in different domains such as energy systems and economics. Creating a forecasting model for a specific use case requires an iterative and complex design process. The typical design process includes the five sections (1) data pre-processing, (2) feature engineering, (3) hyperparameter optimization, (4) forecasting method selection, and (5) forecast ensembling, which are commonly organized in a pipeline structure. One promising approach to handle the ever-growing demand for time series forecasts is automating this design process. The present paper, thus, analyzes the existing literature on automated time series forecasting pipelines to investigate how to automate the design process of forecasting models. Thereby, we consider both Automated Machine Learning (AutoML) and automated statistical forecasting methods in a single forecasting pipeline. For this purpose, we firstly present and compare the proposed automation methods for each pipeline section. Secondly, we analyze the automation methods regarding their interaction, combination, and coverage of the five pipeline sections. For both, we discuss the literature, identify problems, give recommendations, and suggest future research. This review reveals that the majority of papers only cover two or three of the five pipeline sections. We conclude that future research has to holistically consider the automation of the forecasting pipeline to enable the large-scale application of time series forecasting.", "sentences": ["Review of automated time series forecasting pipelines.", "Time series forecasting is fundamental for various use cases in different domains such as energy systems and economics.", "Creating a forecasting model for a specific use case requires an iterative and complex design process.", "The typical design process includes the five sections (1) data pre-processing, (2) feature engineering, (3) hyperparameter optimization, (4) forecasting method selection, and (5) forecast ensembling, which are commonly organized in a pipeline structure.", "One promising approach to handle the ever-growing demand for time series forecasts is automating this design process.", "The present paper, thus, analyzes the existing literature on automated time series forecasting pipelines to investigate how to automate the design process of forecasting models.", "Thereby, we consider both Automated Machine Learning (AutoML) and automated statistical forecasting methods in a single forecasting pipeline.", "For this purpose, we firstly present and compare the proposed automation methods for each pipeline section.", "Secondly, we analyze the automation methods regarding their interaction, combination, and coverage of the five pipeline sections.", "For both, we discuss the literature, identify problems, give recommendations, and suggest future research.", "This review reveals that the majority of papers only cover two or three of the five pipeline sections.", "We conclude that future research has to holistically consider the automation of the forecasting pipeline to enable the large-scale application of time series forecasting."]}
{"id": "http://arxiv.org/abs/2202.01719", "title": "FORML: Learning to Reweight Data for Fairness.", "authors": "Bobby Yan, Skyler Seto, Nicholas Apostoloff", "abstract": "Deployed machine learning models are evaluated by multiple metrics beyond accuracy, such as fairness and robustness. However, such models are typically trained to minimize the average loss for a single metric, which is typically a proxy for accuracy. Training to optimize a single metric leaves these models prone to fairness violations, especially when the population of sub-groups in the training data are imbalanced. This work addresses the challenge of jointly optimizing fairness and predictive performance in the multi-class classification setting by introducing Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters. The approach increases fairness by learning to weight each training datum's contribution to the loss according to its impact on reducing fairness violations, balancing the contributions from both over- and under-represented sub-groups. We empirically validate FORML on a range of benchmark and real-world classification datasets and show that our approach improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face attribute prediction task. This improvement is achieved without pre-processing data or post-processing model outputs, without learning an additional weighting function, and while maintaining accuracy on the original predictive metric.", "sentences": ["FORML: Learning to Reweight Data for Fairness.", "Deployed machine learning models are evaluated by multiple metrics beyond accuracy, such as fairness and robustness.", "However, such models are typically trained to minimize the average loss for a single metric, which is typically a proxy for accuracy.", "Training to optimize a single metric leaves these models prone to fairness violations, especially when the population of sub-groups in the training data are imbalanced.", "This work addresses the challenge of jointly optimizing fairness and predictive performance in the multi-class classification setting by introducing Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters.", "The approach increases fairness by learning to weight each training datum's contribution to the loss according to its impact on reducing fairness violations, balancing the contributions from both over- and under-represented sub-groups.", "We empirically validate FORML on a range of benchmark and real-world classification datasets and show that our approach improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face attribute prediction task.", "This improvement is achieved without pre-processing data or post-processing model outputs, without learning an additional weighting function, and while maintaining accuracy on the original predictive metric."]}
{"id": "http://arxiv.org/abs/2202.01721", "title": "Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits.", "authors": "Aaron David Tucker, Thorsten Joachims", "abstract": "Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\\\"ive approaches.", "sentences": ["Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits.", "Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data.", "However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated.", "To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation.", "To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.", "We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\\\"ive approaches."]}
{"id": "http://arxiv.org/abs/2202.01723", "title": "Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks.", "authors": "Mitchell Daneker, Zhen Zhang, George Em Karniadakis, Lu Lu", "abstract": "The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements. Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks. To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters. We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation.", "sentences": ["Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks.", "The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements.", "Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks.", "To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters.", "We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation."]}
{"id": "http://arxiv.org/abs/2202.01725", "title": "RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds.", "authors": "Thibault de Surrel, Felix Hensel, Mathieu Carri\u00e8re, Th\u00e9o Lacombe, Yuichi Ike, Hiroaki Kurihara, Marc Glisse, Fr\u00e9d\u00e9ric Chazal", "abstract": "The use of topological descriptors in modern machine learning applications, such as Persistence Diagrams (PDs) arising from Topological Data Analysis (TDA), has shown great potential in various domains. However, their practical use in applications is often hindered by two major limitations: the computational complexity required to compute such descriptors exactly, and their sensitivity to even low-level proportions of outliers. In this work, we propose to bypass these two burdens in a data-driven setting by entrusting the estimation of (vectorization of) PDs built on top of point clouds to a neural network architecture that we call RipsNet. Once trained on a given data set, RipsNet can estimate topological descriptors on test data very efficiently with generalization capacity. Furthermore, we prove that RipsNet is robust to input perturbations in terms of the 1-Wasserstein distance, a major improvement over the standard computation of PDs that only enjoys Hausdorff stability, yielding RipsNet to substantially outperform exactly-computed PDs in noisy settings. We showcase the use of RipsNet on both synthetic and real-world data. Our open-source implementation is publicly available at https://github.com/hensel-f/ripsnet and will be included in the Gudhi library.", "sentences": ["RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds.", "The use of topological descriptors in modern machine learning applications, such as Persistence Diagrams (PDs) arising from Topological Data Analysis (TDA), has shown great potential in various domains.", "However, their practical use in applications is often hindered by two major limitations: the computational complexity required to compute such descriptors exactly, and their sensitivity to even low-level proportions of outliers.", "In this work, we propose to bypass these two burdens in a data-driven setting by entrusting the estimation of (vectorization of) PDs built on top of point clouds to a neural network architecture that we call RipsNet.", "Once trained on a given data set, RipsNet can estimate topological descriptors on test data very efficiently with generalization capacity.", "Furthermore, we prove that RipsNet is robust to input perturbations in terms of the 1-Wasserstein distance, a major improvement over the standard computation of PDs that only enjoys Hausdorff stability, yielding RipsNet to substantially outperform exactly-computed PDs in noisy settings.", "We showcase the use of RipsNet on both synthetic and real-world data.", "Our open-source implementation is publicly available at https://github.com/hensel-f/ripsnet and will be included in the Gudhi library."]}
{"id": "http://arxiv.org/abs/2202.01729", "title": "Can machines solve general queueing systems?.", "authors": "Eliran Sherzer, Arik Senderovich, Opher Baron, Dmitry Krass", "abstract": "In this paper, we analyze how well a machine can solve a general problem in queueing theory. To answer this question, we use a deep learning model to predict the stationary queue-length distribution of an $M/G/1$ queue (Poisson arrivals, general service times, one server). To the best of our knowledge, this is the first time a machine learning model is applied to a general queueing theory problem. We chose $M/G/1$ queue for this paper because it lies \"on the cusp\" of the analytical frontier: on the one hand exact solution for this model is available, which is both computationally and mathematically complex. On the other hand, the problem (specifically the service time distribution) is general. This allows us to compare the accuracy and efficiency of the deep learning approach to the analytical solutions.  The two key challenges in applying machine learning to this problem are (1) generating a diverse set of training examples that provide a good representation of a \"generic\" positive-valued distribution, and (2) representations of the continuous distribution of service times as an input. We show how we overcome these challenges.  Our results show that our model is indeed able to predict the stationary behavior of the $M/G/1$ queue extremely accurately: the average value of our metric over the entire test set is $0.0009$. Moreover, our machine learning model is very efficient, computing very accurate stationary distributions in a fraction of a second (an approach based on simulation modeling would take much longer to converge). We also present a case-study that mimics a real-life setting and shows that our approach is more robust and provides more accurate solutions compared to the existing methods. This shows the promise of extending our approach beyond the analytically solvable systems (e.g., $G/G/1$ or $G/G/c$).", "sentences": ["Can machines solve general queueing systems?.", "In this paper, we analyze how well a machine can solve a general problem in queueing theory.", "To answer this question, we use a deep learning model to predict the stationary queue-length distribution of an $M/G/1$ queue (Poisson arrivals, general service times, one server).", "To the best of our knowledge, this is the first time a machine learning model is applied to a general queueing theory problem.", "We chose $M/G/1$ queue for this paper because it lies \"on the cusp\" of the analytical frontier: on the one hand exact solution for this model is available, which is both computationally and mathematically complex.", "On the other hand, the problem (specifically the service time distribution) is general.", "This allows us to compare the accuracy and efficiency of the deep learning approach to the analytical solutions.", "The two key challenges in applying machine learning to this problem are (1) generating a diverse set of training examples that provide a good representation of a \"generic\" positive-valued distribution, and (2) representations of the continuous distribution of service times as an input.", "We show how we overcome these challenges.", "Our results show that our model is indeed able to predict the stationary behavior of the $M/G/1$ queue extremely accurately: the average value of our metric over the entire test set is $0.0009$.", "Moreover, our machine learning model is very efficient, computing very accurate stationary distributions in a fraction of a second (an approach based on simulation modeling would take much longer to converge).", "We also present a case-study that mimics a real-life setting and shows that our approach is more robust and provides more accurate solutions compared to the existing methods.", "This shows the promise of extending our approach beyond the analytically solvable systems (e.g., $G/G/1$ or $G/G/c$)."]}
{"id": "http://arxiv.org/abs/2202.01741", "title": "How to Leverage Unlabeled Data in Offline Reinforcement Learning.", "authors": "Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine", "abstract": "Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.", "sentences": ["How to Leverage Unlabeled Data in Offline Reinforcement Learning.", "Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition.", "In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive.", "How can we best leverage such unlabeled data in offline RL?", "One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data.", "In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all.", "While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results.", "We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels.", "Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings."]}
{"id": "http://arxiv.org/abs/2202.01748", "title": "Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise.", "authors": "Gabriel Ruiz, Oscar Hernan Madrid Padilla, Qing Zhou", "abstract": "Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify \"what causes what?\" Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM). Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG. We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data. We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure.", "sentences": ["Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise.", "Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify \"what causes what?\"", "Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions.", "The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM).", "Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG.", "At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering.", "Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG.", "We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data.", "We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure."]}
{"id": "http://arxiv.org/abs/2202.01752", "title": "Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "authors": "Yu Bai, Chi Jin, Song Mei, Tiancheng Yu", "abstract": "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.", "sentences": ["Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback.", "We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players.", "This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors.", "We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization.", "Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts.", "We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games."]}
{"id": "http://arxiv.org/abs/2202.01758", "title": "PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators.", "authors": "Ali Alshaarawy, Amirali Amirsoleimani, Roman Genov", "abstract": "In this work, PRUNIX, a framework for training and pruning convolutional neural networks is proposed for deployment on memristor crossbar based accelerators. PRUNIX takes into account the numerous non-ideal effects of memristor crossbars including weight quantization, state-drift, aging and stuck-at-faults. PRUNIX utilises a novel Group Sawtooth Regularization intended to improve non-ideality tolerance as well as sparsity, and a novel Adaptive Pruning Algorithm (APA) intended to minimise accuracy loss by considering the sensitivity of different layers of a CNN to pruning. We compare our regularization and pruning methods with other standards on multiple CNN architectures, and observe an improvement of 13% test accuracy when quantization and other non-ideal effects are accounted for with an overall sparsity of 85%, which is similar to other methods", "sentences": ["PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators.", "In this work, PRUNIX, a framework for training and pruning convolutional neural networks is proposed for deployment on memristor crossbar based accelerators.", "PRUNIX takes into account the numerous non-ideal effects of memristor crossbars including weight quantization, state-drift, aging and stuck-at-faults.", "PRUNIX utilises a novel Group Sawtooth Regularization intended to improve non-ideality tolerance as well as sparsity, and a novel Adaptive Pruning Algorithm (APA) intended to minimise accuracy loss by considering the sensitivity of different layers of a CNN to pruning.", "We compare our regularization and pruning methods with other standards on multiple CNN architectures, and observe an improvement of 13% test accuracy when quantization and other non-ideal effects are accounted for with an overall sparsity of 85%, which is similar to other methods"]}
{"id": "http://arxiv.org/abs/2202.01762", "title": "Learning Physics through Images: An Application to Wind-Driven Spatial Patterns.", "authors": "M. Giselle Fern\u00e1ndez-Godino, Donald D. Lucas, Qingkai Kong", "abstract": "For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms are available that can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we trained them using images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. Examples of these phenomena include features in Aeolian dunes and the deposition of volcanic ash, wildfire smoke, and air pollution plumes. We assume that the spatial patterns were collected by an imaging device that records the magnitude of the logarithm of deposition as a red, green, blue (RGB) color image with channels containing values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder allows us to train regression models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full predictive spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size and the full predictive model performance on test data achieves an accuracy of 92%.", "sentences": ["Learning Physics through Images: An Application to Wind-Driven Spatial Patterns.", "For centuries, scientists have observed nature to understand the laws that govern the physical world.", "The traditional process of turning observations into physical understanding is slow.", "Imperfect models are constructed and tested to explain relationships in data.", "Powerful new algorithms are available that can enable computers to learn physics by observing images and videos.", "Inspired by this idea, instead of training machine learning models using physical quantities, we trained them using images, that is, pixel information.", "For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns.", "Examples of these phenomena include features in Aeolian dunes and the deposition of volcanic ash, wildfire smoke, and air pollution plumes.", "We assume that the spatial patterns were collected by an imaging device that records the magnitude of the logarithm of deposition as a red, green, blue (RGB) color image with channels containing values ranging from 0 to 255.", "In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality.", "Reducing the data dimension size with an encoder allows us to train regression models linking geographic and meteorological scalar input quantities to the encoded space.", "Once this is achieved, full predictive spatial patterns are reconstructed using the decoder.", "We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size and the full predictive model performance on test data achieves an accuracy of 92%."]}
{"id": "http://arxiv.org/abs/2202.01764", "title": "JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "authors": "ByungHoon So, Kyuhong Byun, Kyungwon Kang, Seongjin Cho", "abstract": "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set. The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD.", "sentences": ["JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension.", "Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer.", "Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets.", "In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans.", "JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.", "We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set.", "The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD."]}
{"id": "http://arxiv.org/abs/2202.01765", "title": "Who will Leave a Pediatric Weight Management Program and When? -- A machine learning approach for predicting attrition patterns.", "authors": "Hamed Fayyaz, Thao-Ly T. Phan, H. Timothy Bunnell, Rahmatollah Beheshti", "abstract": "Childhood obesity is a major public health concern. Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity and severe obesity who are not able to be successfully managed in the primary care setting; however, high drop-out rates (referred to as attrition) are a major hurdle in delivering successful interventions. Predicting attrition patterns can help providers reduce the attrition rates. Previous work has mainly focused on finding static predictors of attrition using statistical analysis methods. In this study, we present a machine learning model to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining a weight management program. We use a five-year dataset containing the information related to around 4,550 children that we have compiled using data from the Nemours Pediatric Weight Management program. Our models show strong prediction performance as determined by high AUROC scores across different tasks (average AUROC of 0.75 for predicting attrition, and 0.73 for predicting weight outcomes). Additionally, we report the top features predicting attrition and weight outcomes in a series of explanatory experiments.", "sentences": ["Who will Leave a Pediatric Weight Management Program and When? -- A machine learning approach for predicting attrition patterns.", "Childhood obesity is a major public health concern.", "Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity and severe obesity who are not able to be successfully managed in the primary care setting; however, high drop-out rates (referred to as attrition) are a major hurdle in delivering successful interventions.", "Predicting attrition patterns can help providers reduce the attrition rates.", "Previous work has mainly focused on finding static predictors of attrition using statistical analysis methods.", "In this study, we present a machine learning model to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining a weight management program.", "We use a five-year dataset containing the information related to around 4,550 children that we have compiled using data from the Nemours Pediatric Weight Management program.", "Our models show strong prediction performance as determined by high AUROC scores across different tasks (average AUROC of 0.75 for predicting attrition, and 0.73 for predicting weight outcomes).", "Additionally, we report the top features predicting attrition and weight outcomes in a series of explanatory experiments."]}
{"id": "http://arxiv.org/abs/2202.01770", "title": "Exploring Multi-physics with Extremely Weak Supervision.", "authors": "Shihang Feng, Peng Jin, Yinpeng Chen, Xitong Zhang, Zicheng Liu, Youzuo Lin", "abstract": "Multi-physical inversion plays a critical role in geophysics. It has been widely used to infer various physical properties (such as velocity and conductivity), simultaneously. Among those inversion problems, some are explicitly governed by partial differential equations (PDEs), while others are not. Without explicit governing equations, conventional multi-physical inversion techniques will not be feasible and data-driven inversion require expensive full labels. To overcome this issue, we develop a new data-driven multi-physics inversion technique with extremely weak supervision. Our key finding is that the pseudo labels can be constructed by learning the local relationship among geophysical properties at very sparse locations. We explore a multi-physics inversion problem from two distinct measurements (seismic and EM data) to three geophysical properties (velocity, conductivity, and CO$_2$ saturation). Our results show that we are able to invert for properties without explicit governing equations. Moreover, the label data on three geophysical properties can be significantly reduced by 50 times (from 100 down to only 2 locations).", "sentences": ["Exploring Multi-physics with Extremely Weak Supervision.", "Multi-physical inversion plays a critical role in geophysics.", "It has been widely used to infer various physical properties (such as velocity and conductivity), simultaneously.", "Among those inversion problems, some are explicitly governed by partial differential equations (PDEs), while others are not.", "Without explicit governing equations, conventional multi-physical inversion techniques will not be feasible and data-driven inversion require expensive full labels.", "To overcome this issue, we develop a new data-driven multi-physics inversion technique with extremely weak supervision.", "Our key finding is that the pseudo labels can be constructed by learning the local relationship among geophysical properties at very sparse locations.", "We explore a multi-physics inversion problem from two distinct measurements (seismic and EM data) to three geophysical properties (velocity, conductivity, and CO$_2$ saturation).", "Our results show that we are able to invert for properties without explicit governing equations.", "Moreover, the label data on three geophysical properties can be significantly reduced by 50 times (from 100 down to only 2 locations)."]}
{"id": "http://arxiv.org/abs/2202.01771", "title": "Pre-Trained Language Models for Interactive Decision-Making.", "authors": "Shuang Li, Xavier Puig, Yilun Du, Clinton Wang, Ekin Akyurek, Antonio Torralba, Jacob Andreas, Igor Mordatch", "abstract": "Language model (LM) pre-training has proven useful for a wide variety of language processing tasks, but can such pre-training be leveraged for more general machine learning problems? We investigate the effectiveness of language modeling to scaffold learning and generalization in autonomous decision-making. We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings, and translated into actions using a policy network initialized with a pre-trained transformer LM. We demonstrate that this framework enables effective combinatorial generalization across different environments, such as VirtualHome and BabyAI. In particular, for test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6% in VirtualHome. We hypothesize and investigate three possible factors underlying the effectiveness of LM-based policy initialization. We find that sequential representations (vs. fixed-dimensional feature vectors) and the LM objective (not just the transformer architecture) are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.", "sentences": ["Pre-Trained Language Models for Interactive Decision-Making.", "Language model (LM) pre-training has proven useful for a wide variety of language processing tasks, but can such pre-training be leveraged for more general machine learning problems?", "We investigate the effectiveness of language modeling to scaffold learning and generalization in autonomous decision-making.", "We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings, and translated into actions using a policy network initialized with a pre-trained transformer LM.", "We demonstrate that this framework enables effective combinatorial generalization across different environments, such as VirtualHome and BabyAI.", "In particular, for test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6% in VirtualHome.", "We hypothesize and investigate three possible factors underlying the effectiveness of LM-based policy initialization.", "We find that sequential representations (vs. fixed-dimensional feature vectors) and the LM objective (not just the transformer architecture) are both important for generalization.", "Surprisingly, however, the format of the policy inputs encoding (e.g.", "as a natural language string vs. an arbitrary sequential encoding) has little influence.", "Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing."]}
{"id": "http://arxiv.org/abs/2202.01773", "title": "Multiclass learning with margin: exponential rates with no bias-variance trade-off.", "authors": "Stefano Vigogna, Giacomo Meanti, Ernesto De Vito, Lorenzo Rosasco", "abstract": "We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting.", "sentences": ["Multiclass learning with margin: exponential rates with no bias-variance trade-off.", "We study the behavior of error bounds for multiclass classification under suitable margin conditions.", "For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off.", "Different convergence rates can be obtained in correspondence of different margin assumptions.", "With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting."]}
{"id": "http://arxiv.org/abs/1909.01132", "title": "PageRank algorithm for Directed Hypergraph.", "authors": "Loc Tran, Tho Quan, An Mai", "abstract": "During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.", "sentences": ["PageRank algorithm for Directed Hypergraph.", "During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph.", "In this paper, we will model the World Wide Web's link structure as the directed hypergraph.", "Moreover, we will develop the PageRank algorithm for this directed hypergraph.", "Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself.", "The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network."]}
{"id": "http://arxiv.org/abs/2002.11875", "title": "Optimality and Stability in Non-Convex Smooth Games.", "authors": "Guojun Zhang, Pascal Poupart, Yaoliang Yu", "abstract": "Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.", "sentences": ["Optimality and Stability in Non-Convex Smooth Games.", "Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications.", "It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points.", "An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm.", "This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions.", "We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions.", "In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points.", "Finally, we study the stability of gradient algorithms near local minimax points.", "Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases.", "This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games."]}
{"id": "http://arxiv.org/abs/2007.02411", "title": "Assessing External Validity Over Worst-case Subpopulations.", "authors": "Sookyo Jeong, Hongseok Namkoong", "abstract": "Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented. To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations. We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect. Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly. On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts.", "sentences": ["Assessing External Validity Over Worst-case Subpopulations.", "Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented.", "To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations.", "We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect.", "Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly.", "On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts."]}
{"id": "http://arxiv.org/abs/2007.10653", "title": "Accounting for Unobserved Confounding in Domain Generalization.", "authors": "Alexis Bellot, Mihaela van der Schaar", "abstract": "This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model. Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem. Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions. Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters. We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data.", "sentences": ["Accounting for Unobserved Confounding in Domain Generalization.", "This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model.", "Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem.", "Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions.", "Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters.", "We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data."]}
{"id": "http://arxiv.org/abs/2008.03626", "title": "Directed hypergraph neural network.", "authors": "Loc Hoang Tran, Linh Hoang Tran", "abstract": "To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists. However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph. In this paper, we will present the novel neural network method for directed hypergraph. In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method. These methods are employed to solve the node classification task. The two datasets that are used in the experiments are the cora and the citeseer datasets. Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies.", "sentences": ["Directed hypergraph neural network.", "To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists.", "However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph.", "In this paper, we will present the novel neural network method for directed hypergraph.", "In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method.", "These methods are employed to solve the node classification task.", "The two datasets that are used in the experiments are the cora and the citeseer datasets.", "Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies."]}
{"id": "http://arxiv.org/abs/2008.08733", "title": "Optimal Network Compression.", "authors": "Hamed Amini, Zachary Feinstein", "abstract": "This paper introduces a formulation of the optimal network compression problem for financial systems. This general formulation is presented for different levels of network compression or rerouting allowed from the initial interbank network. We prove that this problem is, generically, NP-hard. We focus on objective functions generated by systemic risk measures under systematic shocks to the financial network. We conclude by studying the optimal compression problem for specific networks; this permits us to study the so-called robust fragility of certain network topologies more generally as well as the potential benefits and costs of network compression. In particular, under systematic shocks and heterogeneous financial networks the typical heuristics of robust fragility no longer hold generally.", "sentences": ["Optimal Network Compression.", "This paper introduces a formulation of the optimal network compression problem for financial systems.", "This general formulation is presented for different levels of network compression or rerouting allowed from the initial interbank network.", "We prove that this problem is, generically, NP-hard.", "We focus on objective functions generated by systemic risk measures under systematic shocks to the financial network.", "We conclude by studying the optimal compression problem for specific networks; this permits us to study the so-called robust fragility of certain network topologies more generally as well as the potential benefits and costs of network compression.", "In particular, under systematic shocks and heterogeneous financial networks the typical heuristics of robust fragility no longer hold generally."]}
{"id": "http://arxiv.org/abs/2010.13972", "title": "GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles.", "authors": "Rory Mitchell, Eibe Frank, Geoffrey Holmes", "abstract": "SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs. In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores.", "sentences": ["GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles.", "SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values.", "While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models.", "However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles.", "Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs.", "In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units.", "Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions.", "With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs.", "We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores."]}
{"id": "http://arxiv.org/abs/2102.01934", "title": "Noise-robust classification with hypergraph neural network.", "authors": "Nguyen Trinh Vu Dang, Loc Tran, Linh Tran", "abstract": "This paper presents a novel version of the hypergraph neural network method. This method is utilized to solve the noisy label learning problem. First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the \"noise\" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method. Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem. The accuracies of these five methods are evaluated and compared. Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases. Moreover, the hypergraph neural network methods are at least as good as the graph neural network.", "sentences": ["Noise-robust classification with hypergraph neural network.", "This paper presents a novel version of the hypergraph neural network method.", "This method is utilized to solve the noisy label learning problem.", "First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the \"noise\" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method.", "Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem.", "The accuracies of these five methods are evaluated and compared.", "Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases.", "Moreover, the hypergraph neural network methods are at least as good as the graph neural network."]}
{"id": "http://arxiv.org/abs/2102.02705", "title": "EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models.", "authors": "Rajesh Bordawekar, Bulent Abali, Ming-Hung Chen", "abstract": "In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields. EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged. Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats. EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation. We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings. Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget. The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16. EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models. When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets.", "sentences": ["EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models.", "In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression.", "Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits.", "We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields.", "EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged.", "Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats.", "EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation.", "We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings.", "Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget.", "The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16.", "EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models.", "When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets."]}
{"id": "http://arxiv.org/abs/2102.07389", "title": "And/or trade-off in artificial neurons: impact on adversarial robustness.", "authors": "Alessandro Fontana", "abstract": "Since its discovery in 2013, the phenomenon of adversarial examples has attracted a growing amount of attention from the machine learning community. A deeper understanding of the problem could lead to a better comprehension of how information is processed and encoded in neural networks and, more in general, could help to solve the issue of interpretability in machine learning. Our idea to increase adversarial resilience starts with the observation that artificial neurons can be divided in two broad categories: AND-like neurons and OR-like neurons. Intuitively, the former are characterised by a relatively low number of combinations of input values which trigger neuron activation, while for the latter the opposite is true. Our hypothesis is that the presence in a network of a sufficiently high number of OR-like neurons could lead to classification \"brittleness\" and increase the network's susceptibility to adversarial attacks. After constructing an operational definition of a neuron AND-like behaviour, we proceed to introduce several measures to increase the proportion of AND-like neurons in the network: L1 norm weight normalisation; application of an input filter; comparison between the neuron output's distribution obtained when the network is fed with the actual data set and the distribution obtained when the network is fed with a randomised version of the former called \"scrambled data set\". Tests performed on the MNIST data set hint that the proposed measures could represent an interesting direction to explore.", "sentences": ["And/or trade-off in artificial neurons: impact on adversarial robustness.", "Since its discovery in 2013, the phenomenon of adversarial examples has attracted a growing amount of attention from the machine learning community.", "A deeper understanding of the problem could lead to a better comprehension of how information is processed and encoded in neural networks and, more in general, could help to solve the issue of interpretability in machine learning.", "Our idea to increase adversarial resilience starts with the observation that artificial neurons can be divided in two broad categories: AND-like neurons and OR-like neurons.", "Intuitively, the former are characterised by a relatively low number of combinations of input values which trigger neuron activation, while for the latter the opposite is true.", "Our hypothesis is that the presence in a network of a sufficiently high number of OR-like neurons could lead to classification \"brittleness\" and increase the network's susceptibility to adversarial attacks.", "After constructing an operational definition of a neuron AND-like behaviour, we proceed to introduce several measures to increase the proportion of AND-like neurons in the network: L1 norm weight normalisation; application of an input filter; comparison between the neuron output's distribution obtained when the network is fed with the actual data set and the distribution obtained when the network is fed with a randomised version of the former called \"scrambled data set\".", "Tests performed on the MNIST data set hint that the proposed measures could represent an interesting direction to explore."]}
{"id": "http://arxiv.org/abs/2102.09631", "title": "Peering Beyond the Gradient Veil with Distributed Auto Differentiation.", "authors": "Bradley T. Baker, Aashis Khanal, Vince D. Calhoun, Barak Pearlmutter, Sergey M. Plis", "abstract": "Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts.  The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth.  We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing. Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product. We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets. The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms.", "sentences": ["Peering Beyond the Gradient Veil with Distributed Auto Differentiation.", "Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts.", "The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth.", "We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation.", "The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing.", "Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product.", "We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets.", "The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms."]}
{"id": "http://arxiv.org/abs/2103.01710", "title": "Autobahn: Automorphism-based Graph Neural Nets.", "authors": "Erik Henning Thiede, Wenda Zhou, Risi Kondor", "abstract": "We introduce Automorphism-based graph neural networks (Autobahn), a new family of graph neural networks. In an Autobahn, we decompose the graph into a collection of subgraphs and apply local convolutions that are equivariant to each subgraph's automorphism group. Specific choices of local neighborhoods and subgraphs recover existing architectures such as message passing neural networks. Our formalism also encompasses novel architectures: as an example, we introduce a graph neural network that decomposes the graph into paths and cycles. The resulting convolutions reflect the natural way that parts of the graph can transform, preserving the intuitive meaning of convolution without sacrificing global permutation equivariance. We validate our approach by applying Autobahn to molecular graphs, where it achieves results competitive with state-of-the-art message passing algorithms.", "sentences": ["Autobahn: Automorphism-based Graph Neural Nets.", "We introduce Automorphism-based graph neural networks (Autobahn), a new family of graph neural networks.", "In an Autobahn, we decompose the graph into a collection of subgraphs and apply local convolutions that are equivariant to each subgraph's automorphism group.", "Specific choices of local neighborhoods and subgraphs recover existing architectures such as message passing neural networks.", "Our formalism also encompasses novel architectures: as an example, we introduce a graph neural network that decomposes the graph into paths and cycles.", "The resulting convolutions reflect the natural way that parts of the graph can transform, preserving the intuitive meaning of convolution without sacrificing global permutation equivariance.", "We validate our approach by applying Autobahn to molecular graphs, where it achieves results competitive with state-of-the-art message passing algorithms."]}
{"id": "http://arxiv.org/abs/2103.07853", "title": "Membership Inference Attacks on Machine Learning: A Survey.", "authors": "Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, Xuyun Zhang", "abstract": "Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis. However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. MIAs on ML models can directly lead to a privacy breach. For example, via identifying the fact that a clinical record that has been used to train a model associated with a certain disease, an attacker can infer that the owner of the clinical record has the disease with a high chance. In recent years, MIAs have been shown to be effective on various ML models, e.g., classification models and generative models. Meanwhile, many defense methods have been proposed to mitigate MIAs. Although MIAs on ML models form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this paper, we conduct the first comprehensive survey on membership inference attacks and defenses. We provide the taxonomies for both attacks and defenses, based on their characterizations, and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain. To further help the researchers, we have created an online resource repository, which we will keep updated with future relevant work. Interested readers can find the repository at https://github.com/HongshengHu/membership-inference-machine-learning-literature.", "sentences": ["Membership Inference Attacks on Machine Learning: A Survey.", "Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis.", "However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not.", "MIAs on ML models can directly lead to a privacy breach.", "For example, via identifying the fact that a clinical record that has been used to train a model associated with a certain disease, an attacker can infer that the owner of the clinical record has the disease with a high chance.", "In recent years, MIAs have been shown to be effective on various ML models, e.g., classification models and generative models.", "Meanwhile, many defense methods have been proposed to mitigate MIAs.", "Although MIAs on ML models form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet.", "In this paper, we conduct the first comprehensive survey on membership inference attacks and defenses.", "We provide the taxonomies for both attacks and defenses, based on their characterizations, and discuss their pros and cons.", "Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area.", "This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain.", "To further help the researchers, we have created an online resource repository, which we will keep updated with future relevant work.", "Interested readers can find the repository at https://github.com/HongshengHu/membership-inference-machine-learning-literature."]}
{"id": "http://arxiv.org/abs/2103.13466", "title": "Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case.", "authors": "Benoit Collins, Tomohiro Hayase", "abstract": "Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics. FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians. However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far. The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers. Haar distributed orthogonal matrices are essential for achieving dynamical isometry. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key of the proof is an invariance of the MLP. Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact.", "sentences": ["Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case.", "Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics.", "FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians.", "However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far.", "The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers.", "Haar distributed orthogonal matrices are essential for achieving dynamical isometry.", "In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case.", "A key of the proof is an invariance of the MLP.", "Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change.", "Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement.", "Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact."]}
{"id": "http://arxiv.org/abs/2103.16440", "title": "Neural Transformation Learning for Deep Anomaly Detection Beyond Images.", "authors": "Chen Qiu, Timo Pfrommer, Marius Kloft, Stephan Mandt, Maja Rudolph", "abstract": "Data transformations (e.g. rotations, reflections, and cropping) play an important role in self-supervised learning. Typically, images are transformed into different views, and neural networks trained on tasks involving these views produce useful feature representations for downstream tasks, including anomaly detection. However, for anomaly detection beyond image data, it is often unclear which transformations to use. Here we present a simple end-to-end procedure for anomaly detection with learnable transformations. The key idea is to embed the transformed data into a semantic space such that the transformed data still resemble their untransformed form, while different transformations are easily distinguishable. Extensive experiments on time series demonstrate that our proposed method outperforms existing approaches in the one-vs.-rest setting and is competitive in the more challenging n-vs.-rest anomaly detection task. On tabular datasets from the medical and cyber-security domains, our method learns domain-specific transformations and detects anomalies more accurately than previous work.", "sentences": ["Neural Transformation Learning for Deep Anomaly Detection Beyond Images.", "Data transformations (e.g.", "rotations, reflections, and cropping) play an important role in self-supervised learning.", "Typically, images are transformed into different views, and neural networks trained on tasks involving these views produce useful feature representations for downstream tasks, including anomaly detection.", "However, for anomaly detection beyond image data, it is often unclear which transformations to use.", "Here we present a simple end-to-end procedure for anomaly detection with learnable transformations.", "The key idea is to embed the transformed data into a semantic space such that the transformed data still resemble their untransformed form, while different transformations are easily distinguishable.", "Extensive experiments on time series demonstrate that our proposed method outperforms existing approaches in the one-vs.-rest setting and is competitive in the more challenging n-vs.-rest anomaly detection task.", "On tabular datasets from the medical and cyber-security domains, our method learns domain-specific transformations and detects anomalies more accurately than previous work."]}
{"id": "http://arxiv.org/abs/2104.09237", "title": "Inverse Bayesian Optimization: Learning Human Acquisition Functions in an Exploration vs Exploitation Search Task.", "authors": "Nathan Sandholtz, Yohsuke Miyamoto, Luke Bornn, Maurice Smith", "abstract": "This paper introduces a probabilistic framework to estimate parameters of an acquisition function given observed human behavior that can be modeled as a collection of sample paths from a Bayesian optimization procedure. The methodology involves defining a likelihood on observed human behavior from an optimization task, where the likelihood is parameterized by a Bayesian optimization subroutine governed by an unknown acquisition function. This structure enables us to make inference on a subject's acquisition function while allowing their behavior to deviate around the solution to the Bayesian optimization subroutine. To test our methods, we designed a sequential optimization task which forced subjects to balance exploration and exploitation in search of an invisible target location. Applying our proposed methods to the resulting data, we find that many subjects tend to exhibit exploration preferences beyond that of standard acquisition functions to capture. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.", "sentences": ["Inverse Bayesian Optimization: Learning Human Acquisition Functions in an Exploration vs Exploitation Search Task.", "This paper introduces a probabilistic framework to estimate parameters of an acquisition function given observed human behavior that can be modeled as a collection of sample paths from a Bayesian optimization procedure.", "The methodology involves defining a likelihood on observed human behavior from an optimization task, where the likelihood is parameterized by a Bayesian optimization subroutine governed by an unknown acquisition function.", "This structure enables us to make inference on a subject's acquisition function while allowing their behavior to deviate around the solution to the Bayesian optimization subroutine.", "To test our methods, we designed a sequential optimization task which forced subjects to balance exploration and exploitation in search of an invisible target location.", "Applying our proposed methods to the resulting data, we find that many subjects tend to exhibit exploration preferences beyond that of standard acquisition functions to capture.", "Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task."]}
{"id": "http://arxiv.org/abs/2104.12199", "title": "Sampling Permutations for Shapley Value Estimation.", "authors": "Rory Mitchell, Joshua Cooper, Eibe Frank, Geoffrey Holmes", "abstract": "Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.", "sentences": ["Sampling Permutations for Shapley Value Estimation.", "Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models.", "As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation.", "Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations.", "To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically.", "First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature.", "The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations.", "Second, we exploit connections between the hypersphere $\\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties.", "Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations."]}
{"id": "http://arxiv.org/abs/2105.02103", "title": "Prototype Memory for Large-scale Face Representation Learning.", "authors": "Evgeny Smirnov, Nikita Garaev, Vasiliy Galyuk, Evgeny Lukyanets", "abstract": "Face representation learning using datasets with a massive number of identities requires appropriate training methods. Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons. Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation. These methods, however, have a set of disadvantages. One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals. This problem is especially serious in ultra-large-scale datasets. In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size. Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way. New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch. These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training. To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of. Prototype Memory is computationally efficient and independent of dataset size. It can be used with various loss functions, hard example mining algorithms and encoder architectures. We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks.", "sentences": ["Prototype Memory for Large-scale Face Representation Learning.", "Face representation learning using datasets with a massive number of identities requires appropriate training methods.", "Softmax-based approach, currently the state-of-the-art in face recognition, in its usual \"full softmax\" form is not suitable for datasets with millions of persons.", "Several methods, based on the \"sampled softmax\" approach, were proposed to remove this limitation.", "These methods, however, have a set of disadvantages.", "One of them is a problem of \"prototype obsolescence\": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals.", "This problem is especially serious in ultra-large-scale datasets.", "In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size.", "Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way.", "New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch.", "These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training.", "To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of.", "Prototype Memory is computationally efficient and independent of dataset size.", "It can be used with various loss functions, hard example mining algorithms and encoder architectures.", "We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks."]}
{"id": "http://arxiv.org/abs/2105.02522", "title": "Neural graphical modelling in continuous-time: consistency guarantees and algorithms.", "authors": "Alexis Bellot, Kim Branson, Mihaela van der Schaar", "abstract": "The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based structure learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.", "sentences": ["Neural graphical modelling in continuous-time: consistency guarantees and algorithms.", "The discovery of structure from time series data is a key problem in fields of study working with complex systems.", "Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time.", "Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling.", "In this paper, we consider score-based structure learning for the study of dynamical systems.", "We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations.", "Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems."]}
{"id": "http://arxiv.org/abs/2105.07066", "title": "Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data.", "authors": "Hongda Wu, Ping Wang", "abstract": "Federated Learning (FL) is a distributed learning paradigm that enables a large number of resource-limited nodes to collaboratively train a model without data sharing. The non-independent-and-identically-distributed (non-i.i.d.) data samples invoke discrepancies between the global and local objectives, making the FL model slow to converge. In this paper, we proposed Optimal Aggregation algorithm for better aggregation, which finds out the optimal subset of local updates of participating nodes in each global round, by identifying and excluding the adverse local updates via checking the relationship between the local gradient and the global gradient. Then, we proposed a Probabilistic Node Selection framework (FedPNS) to dynamically change the probability for each node to be selected based on the output of Optimal Aggregation. FedPNS can preferentially select nodes that propel faster model convergence. The unbiasedness of the proposed FedPNS design is illustrated and the convergence rate improvement of FedPNS over the commonly adopted Federated Averaging (FedAvg) algorithm is analyzed theoretically. Experimental results demonstrate the effectiveness of FedPNS in accelerating the FL convergence rate, as compared to FedAvg with random node selection.", "sentences": ["Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data.", "Federated Learning (FL) is a distributed learning paradigm that enables a large number of resource-limited nodes to collaboratively train a model without data sharing.", "The non-independent-and-identically-distributed (non-i.i.d.)", "data samples invoke discrepancies between the global and local objectives, making the FL model slow to converge.", "In this paper, we proposed Optimal Aggregation algorithm for better aggregation, which finds out the optimal subset of local updates of participating nodes in each global round, by identifying and excluding the adverse local updates via checking the relationship between the local gradient and the global gradient.", "Then, we proposed a Probabilistic Node Selection framework (FedPNS) to dynamically change the probability for each node to be selected based on the output of Optimal Aggregation.", "FedPNS can preferentially select nodes that propel faster model convergence.", "The unbiasedness of the proposed FedPNS design is illustrated and the convergence rate improvement of FedPNS over the commonly adopted Federated Averaging (FedAvg) algorithm is analyzed theoretically.", "Experimental results demonstrate the effectiveness of FedPNS in accelerating the FL convergence rate, as compared to FedAvg with random node selection."]}
{"id": "http://arxiv.org/abs/2105.14172", "title": "A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering.", "authors": "Suyun Liu, Luis Nunes Vicente", "abstract": "In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates. The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets.", "sentences": ["A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering.", "In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness.", "A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth.", "To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates.", "The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function.", "Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets."]}
{"id": "http://arxiv.org/abs/2105.14933", "title": "The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC.", "authors": "Thabang Lebese, Xifeng Ruan", "abstract": "Semi-supervision in Machine Learning can be used in searches for new physics where the signal plus background regions are not labelled. This strongly reduces model dependency in the search for signals Beyond the Standard Model. This approach displays the drawback in that over-fitting can give rise to fake signals. Tossing toy Monte Carlo (MC) events can be used to estimate the corresponding trials factor through a frequentist inference. However, MC events that are based on full detector simulations are resource intensive. Generative Adversarial Networks (GANs) can be used to mimic MC generators. GANs are powerful generative models, but often suffer from training instability. We henceforth show a review of GANs. We advocate the use of Wasserstein GAN (WGAN) with weight clipping and WGAN with gradient penalty (WGAN-GP) where the norm of gradient of the critic is penalized with respect to its input. Following the emergence of multi-lepton anomalies, we apply GANs for the generation of di-leptons final states in association with $b$-quarks at the LHC. A good agreement between the MC and the WGAN-GP generated events is found for the observables selected in the study.", "sentences": ["The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC.", "Semi-supervision in Machine Learning can be used in searches for new physics where the signal plus background regions are not labelled.", "This strongly reduces model dependency in the search for signals Beyond the Standard Model.", "This approach displays the drawback in that over-fitting can give rise to fake signals.", "Tossing toy Monte Carlo (MC) events can be used to estimate the corresponding trials factor through a frequentist inference.", "However, MC events that are based on full detector simulations are resource intensive.", "Generative Adversarial Networks (GANs) can be used to mimic MC generators.", "GANs are powerful generative models, but often suffer from training instability.", "We henceforth show a review of GANs.", "We advocate the use of Wasserstein GAN (WGAN) with weight clipping and WGAN with gradient penalty (WGAN-GP) where the norm of gradient of the critic is penalized with respect to its input.", "Following the emergence of multi-lepton anomalies, we apply GANs for the generation of di-leptons final states in association with $b$-quarks at the LHC.", "A good agreement between the MC and the WGAN-GP generated events is found for the observables selected in the study."]}
{"id": "http://arxiv.org/abs/2106.03186", "title": "Reverse Engineering the Neural Tangent Kernel.", "authors": "James B. Simon, Sajant Anand, Michael R. DeWeese", "abstract": "The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning. To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the conjugate or neural tangent kernel of a fully-connected neural network with only one hidden layer. We verify our construction numerically and demonstrate its utility as a design tool for finite fully-connected networks in several experiments.", "sentences": ["Reverse Engineering the Neural Tangent Kernel.", "The development of methods to guide the design of neural networks is an important open challenge for deep learning theory.", "As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning.", "To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the conjugate or neural tangent kernel of a fully-connected neural network with only one hidden layer.", "We verify our construction numerically and demonstrate its utility as a design tool for finite fully-connected networks in several experiments."]}
{"id": "http://arxiv.org/abs/2106.04149", "title": "To Smooth or Not? When Label Smoothing Meets Noisy Labels.", "authors": "Jiaheng Wei, Hangyu Liu, Tongliang Liu, Gang Niu, Yang Liu", "abstract": "Label smoothing (LS) is an arising learning paradigm that uses the positively weighted average of both the hard training labels and uniformly distributed soft labels. It was shown that LS serves as a regularizer for training data with hard labels and therefore improves the generalization of the model. Later it was reported LS even helps with improving robustness when learning with noisy labels. However, we observed that the advantage of LS vanishes when we operate in a high label noise regime. Intuitively speaking, this is due to the increased entropy of $\\mathbb{P}(\\text{noisy label}|X)$ when the noise rate is high, in which case, further applying LS tends to \"oversmooth\" the estimated posterior. We proceeded to discover that several learning-with-noisy-labels solutions in the literature instead relate more closely to negative/not label smoothing (NLS), which acts counter to LS and defines as using a negative weight to combine the hard and soft labels! We provide understandings for the properties of LS and NLS when learning with noisy labels. Among other established properties, we theoretically show NLS is considered more beneficial when the label noise rates are high. We provide extensive experimental results on multiple benchmarks to support our findings too.", "sentences": ["To Smooth or Not? When Label Smoothing Meets Noisy Labels.", "Label smoothing (LS) is an arising learning paradigm that uses the positively weighted average of both the hard training labels and uniformly distributed soft labels.", "It was shown that LS serves as a regularizer for training data with hard labels and therefore improves the generalization of the model.", "Later it was reported LS even helps with improving robustness when learning with noisy labels.", "However, we observed that the advantage of LS vanishes when we operate in a high label noise regime.", "Intuitively speaking, this is due to the increased entropy of $\\mathbb{P}(\\text{noisy label}|X)$ when the noise rate is high, in which case, further applying LS tends to \"oversmooth\" the estimated posterior.", "We proceeded to discover that several learning-with-noisy-labels solutions in the literature instead relate more closely to negative/not label smoothing (NLS), which acts counter to LS and defines as using a negative weight to combine the hard and soft labels!", "We provide understandings for the properties of LS and NLS when learning with noisy labels.", "Among other established properties, we theoretically show NLS is considered more beneficial when the label noise rates are high.", "We provide extensive experimental results on multiple benchmarks to support our findings too."]}
{"id": "http://arxiv.org/abs/2106.05418", "title": "Probing transfer learning with a model of synthetic correlated datasets.", "authors": "Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, Lenka Zdeborov\u00e1", "abstract": "Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization.", "sentences": ["Probing transfer learning with a model of synthetic correlated datasets.", "Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task.", "Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited.", "In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets.", "This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task.", "Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data.", "Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization."]}
{"id": "http://arxiv.org/abs/2106.08161", "title": "Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness.", "authors": "Adam Foster, \u00c1rpi Vez\u00e9r, Craig A Glastonbury, P\u00e1id\u00ed Creed, Sam Abujudeh, Aaron Sim", "abstract": "Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.", "sentences": ["Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness.", "Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology.", "Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges.", "We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space.", "We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions.", "We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data.", "We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field."]}
{"id": "http://arxiv.org/abs/2106.08902", "title": "Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits.", "authors": "Avishek Ghosh, Abishek Sankararaman, Kannan Ramchandran", "abstract": "We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\\mathcal{O}(\\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\\mathcal{O}(T^{\\frac{1}{2} + \\varepsilon}/(N)^{\\frac{1}{2} -\\varepsilon})$ if its cluster is not well separated, where $\\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\\epsilon_i$, attains a regret scaling of $\\widetilde{O}(\\epsilon_i\\sqrt{T})$. This demonstrates that if the user representations are close (small $\\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.", "sentences": ["Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits.", "We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical.", "We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average.", "In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization.", "We show that, for any agent, the regret scales as $\\mathcal{O}(\\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\\mathcal{O}(T^{\\frac{1}{2} + \\varepsilon}/(N)^{\\frac{1}{2} -\\varepsilon})$ if its cluster is not well separated, where $\\varepsilon$ is positive and arbitrarily close to $0$.", "Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity.", "In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model.", "We show that, an agent $i$ whose parameter deviates from the population average by $\\epsilon_i$, attains a regret scaling of $\\widetilde{O}(\\epsilon_i\\sqrt{T})$.", "This demonstrates that if the user representations are close (small $\\epsilon_i)$, the resulting regret is low, and vice-versa.", "The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines."]}
{"id": "http://arxiv.org/abs/2106.10466", "title": "TS2Vec: Towards Universal Representation of Time Series.", "authors": "Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, Bixiong Xu", "abstract": "This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp. Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps. We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations. As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets. The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks. A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting. Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature. The source code is publicly available at https://github.com/yuezhihan/ts2vec.", "sentences": ["TS2Vec: Towards Universal Representation of Time Series.", "This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level.", "Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp.", "Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps.", "We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations.", "As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets.", "The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks.", "A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting.", "Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature.", "The source code is publicly available at https://github.com/yuezhihan/ts2vec."]}
{"id": "http://arxiv.org/abs/2106.10771", "title": "Multirate Training of Neural Networks.", "authors": "Tiffany Vlaar, Benedict Leimkuhler", "abstract": "We propose multirate training of neural networks: partitioning neural network parameters into \"fast\" and \"slow\" parts which are trained on different time scales. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.", "sentences": ["Multirate Training of Neural Networks.", "We propose multirate training of neural networks: partitioning neural network parameters into \"fast\" and \"slow\" parts which are trained on different time scales.", "By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks.", "We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models.", "We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD.", "We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch.", "A multirate approach can be used to learn different features present in the data and as a form of regularization.", "Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area."]}
{"id": "http://arxiv.org/abs/2106.16004", "title": "What can linear interpolation of neural network loss landscapes tell us?.", "authors": "Tiffany Vlaar, Jonathan Frankle", "abstract": "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model. Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.", "sentences": ["What can linear interpolation of neural network loss landscapes tell us?.", "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems.", "Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion.", "One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization.", "On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem.", "In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.", "Further, we use linear interpolation to study the role played by individual layers and substructures of the network.", "We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model.", "Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization."]}
{"id": "http://arxiv.org/abs/2107.03311", "title": "RoFL: Attestable Robustness for Secure Federated Learning.", "authors": "Lukas Burkhalter, Hidde Lycklama, Alexander Viand, Nicolas K\u00fcchler, Anwar Hithnawi", "abstract": "Even though recent years have seen many attacks exposing severe vulnerabilities in federated learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work we demystify the inner workings of existing targeted attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. In addition, we show how constraints on client updates can effectively improve robustness. To incorporate these constraints into secure FL protocols, we design and develop RoFL, a new secure FL system that enables constraints to be expressed and enforced on high-dimensional encrypted model updates. In essence, RoFL augments existing secure FL aggregation protocols with zero-knowledge proofs. Due to the scale of FL, realizing these checks efficiently presents a paramount challenge. We introduce several optimizations at the ML layer that allow us to reduce the number of cryptographic checks needed while preserving the effectiveness of our defenses. We show that RoFL scales to the sizes of models used in real-world FL deployments.", "sentences": ["RoFL: Attestable Robustness for Secure Federated Learning.", "Even though recent years have seen many attacks exposing severe vulnerabilities in federated learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking.", "In this work we demystify the inner workings of existing targeted attacks.", "We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging.", "We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity.", "This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity.", "In addition, we show how constraints on client updates can effectively improve robustness.", "To incorporate these constraints into secure FL protocols, we design and develop RoFL, a new secure FL system that enables constraints to be expressed and enforced on high-dimensional encrypted model updates.", "In essence, RoFL augments existing secure FL aggregation protocols with zero-knowledge proofs.", "Due to the scale of FL, realizing these checks efficiently presents a paramount challenge.", "We introduce several optimizations at the ML layer that allow us to reduce the number of cryptographic checks needed while preserving the effectiveness of our defenses.", "We show that RoFL scales to the sizes of models used in real-world FL deployments."]}
{"id": "http://arxiv.org/abs/2107.05802", "title": "How many degrees of freedom do we need to train deep networks: a loss landscape perspective.", "authors": "Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli", "abstract": "A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.", "sentences": ["How many degrees of freedom do we need to train deep networks: a loss landscape perspective.", "A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters.", "We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality.", "We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold.", "This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases.", "We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape.", "In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large.", "In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large.", "Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces.", "Code is available at https://github.com/ganguli-lab/degrees-of-freedom."]}
{"id": "http://arxiv.org/abs/2107.09110", "title": "OnlineSTL: Scaling Time Series Decomposition by 100x.", "authors": "Abhinav Mishra, Ram Sriharsha, Sichen Zhong", "abstract": "Decomposing a complex time series into trend, seasonality, and remainder components is an important primitive that facilitates time series anomaly detection, change point detection, and forecasting. Although numerous batch algorithms are known for time series decomposition, none operate well in an online scalable setting where high throughput and real-time response are paramount. In this paper, we propose OnlineSTL, a novel online algorithm for time series decomposition which is highly scalable and is deployed for real-time metrics monitoring on high-resolution, high-ingest rate data. Experiments on different synthetic and real world time series datasets demonstrate that OnlineSTL achieves orders of magnitude speedups (100x) while maintaining quality of decomposition.", "sentences": ["OnlineSTL: Scaling Time Series Decomposition by 100x.", "Decomposing a complex time series into trend, seasonality, and remainder components is an important primitive that facilitates time series anomaly detection, change point detection, and forecasting.", "Although numerous batch algorithms are known for time series decomposition, none operate well in an online scalable setting where high throughput and real-time response are paramount.", "In this paper, we propose OnlineSTL, a novel online algorithm for time series decomposition which is highly scalable and is deployed for real-time metrics monitoring on high-resolution, high-ingest rate data.", "Experiments on different synthetic and real world time series datasets demonstrate that OnlineSTL achieves orders of magnitude speedups (100x) while maintaining quality of decomposition."]}
{"id": "http://arxiv.org/abs/2108.08230", "title": "Predicting Dynamic Stability of Power Grids using Graph Neural Networks.", "authors": "Christian Nauck, Michael Lindner, Konstantin Sch\u00fcrholt, Haoming Zhang, Paul Schultz, J\u00fcrgen Kurths, Ingrid Isenhardt, Frank Hellmann", "abstract": "The prediction of dynamical stability of power grids becomes more important and challenging with increasing shares of renewable energy sources due to their decentralized structure, reduced inertia and volatility. We investigate the feasibility of applying graph neural networks (GNN) to predict dynamic stability of synchronisation in complex power grids using the single-node basin stability (SNBS) as a measure. To do so, we generate two synthetic datasets for grids with 20 and 100 nodes respectively and estimate SNBS using Monte-Carlo sampling. Those datasets are used to train and evaluate the performance of eight different GNN-models. All models use the full graph without simplifications as input and predict SNBS in a nodal-regression-setup. We show that SNBS can be predicted in general and the performance significantly changes using different GNN-models. Furthermore, we observe interesting transfer capabilities of our approach: GNN-models trained on smaller grids can directly be applied on larger grids without the need of retraining.", "sentences": ["Predicting Dynamic Stability of Power Grids using Graph Neural Networks.", "The prediction of dynamical stability of power grids becomes more important and challenging with increasing shares of renewable energy sources due to their decentralized structure, reduced inertia and volatility.", "We investigate the feasibility of applying graph neural networks (GNN) to predict dynamic stability of synchronisation in complex power grids using the single-node basin stability (SNBS) as a measure.", "To do so, we generate two synthetic datasets for grids with 20 and 100 nodes respectively and estimate SNBS using Monte-Carlo sampling.", "Those datasets are used to train and evaluate the performance of eight different GNN-models.", "All models use the full graph without simplifications as input and predict SNBS in a nodal-regression-setup.", "We show that SNBS can be predicted in general and the performance significantly changes using different GNN-models.", "Furthermore, we observe interesting transfer capabilities of our approach: GNN-models trained on smaller grids can directly be applied on larger grids without the need of retraining."]}
{"id": "http://arxiv.org/abs/2109.01262", "title": "On the Accuracy of Analog Neural Network Inference Accelerators.", "authors": "T. Patrick Xiao, Ben Feinberg, Christopher H. Bennett, Venkatraman Prabhakar, Prashant Saxena, Vineet Agrawal, Sapan Agarwal, Matthew J. Marinella", "abstract": "Specialized accelerators have recently garnered attention as a method to reduce the power consumption of neural network inference. A promising category of accelerators utilizes nonvolatile memory arrays to both store weights and perform $\\textit{in situ}$ analog computation inside the array. While prior work has explored the design space of analog accelerators to optimize performance and energy efficiency, there is seldom a rigorous evaluation of the accuracy of these accelerators. This work shows how architectural design decisions, particularly in mapping neural network parameters to analog memory cells, influence inference accuracy. When evaluated using ResNet50 on ImageNet, the resilience of the system to analog non-idealities - cell programming errors, analog-to-digital converter resolution, and array parasitic resistances - all improve when analog quantities in the hardware are made proportional to the weights in the network. Moreover, contrary to the assumptions of prior work, nearly equivalent resilience to cell imprecision can be achieved by fully storing weights as analog quantities, rather than spreading weight bits across multiple devices, often referred to as bit slicing. By exploiting proportionality, analog system designers have the freedom to match the precision of the hardware to the needs of the algorithm, rather than attempting to guarantee the same level of precision in the intermediate results as an equivalent digital accelerator. This ultimately results in an analog accelerator that is more accurate, more robust to analog errors, and more energy-efficient.", "sentences": ["On the Accuracy of Analog Neural Network Inference Accelerators.", "Specialized accelerators have recently garnered attention as a method to reduce the power consumption of neural network inference.", "A promising category of accelerators utilizes nonvolatile memory arrays to both store weights and perform $\\textit{in situ}$ analog computation inside the array.", "While prior work has explored the design space of analog accelerators to optimize performance and energy efficiency, there is seldom a rigorous evaluation of the accuracy of these accelerators.", "This work shows how architectural design decisions, particularly in mapping neural network parameters to analog memory cells, influence inference accuracy.", "When evaluated using ResNet50 on ImageNet, the resilience of the system to analog non-idealities - cell programming errors, analog-to-digital converter resolution, and array parasitic resistances - all improve when analog quantities in the hardware are made proportional to the weights in the network.", "Moreover, contrary to the assumptions of prior work, nearly equivalent resilience to cell imprecision can be achieved by fully storing weights as analog quantities, rather than spreading weight bits across multiple devices, often referred to as bit slicing.", "By exploiting proportionality, analog system designers have the freedom to match the precision of the hardware to the needs of the algorithm, rather than attempting to guarantee the same level of precision in the intermediate results as an equivalent digital accelerator.", "This ultimately results in an analog accelerator that is more accurate, more robust to analog errors, and more energy-efficient."]}
{"id": "http://arxiv.org/abs/2109.03699", "title": "Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis.", "authors": "Ziyi Chen, Yi Zhou, Rongrong Chen, Shaofeng Zou", "abstract": "Actor-critic (AC) algorithms have been widely adopted in decentralized multi-agent systems to learn the optimal joint control policy. However, existing decentralized AC algorithms either do not preserve the privacy of agents or are not sample and communication-efficient. In this work, we develop two decentralized AC and natural AC (NAC) algorithms that are private, and sample and communication-efficient. In both algorithms, agents share noisy information to preserve privacy and adopt mini-batch updates to improve sample and communication efficiency. Particularly for decentralized NAC, we develop a decentralized Markovian SGD algorithm with an adaptive mini-batch size to efficiently compute the natural policy gradient. Under Markovian sampling and linear function approximation, we prove the proposed decentralized AC and NAC algorithms achieve the state-of-the-art sample complexities $\\mathcal{O}\\big(\\epsilon^{-2}\\ln(\\epsilon^{-1})\\big)$ and $\\mathcal{O}\\big(\\epsilon^{-3}\\ln(\\epsilon^{-1})\\big)$, respectively, and the same small communication complexity $\\mathcal{O}\\big(\\epsilon^{-1}\\ln(\\epsilon^{-1})\\big)$. Numerical experiments demonstrate that the proposed algorithms achieve lower sample and communication complexities than the existing decentralized AC algorithm.", "sentences": ["Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis.", "Actor-critic (AC) algorithms have been widely adopted in decentralized multi-agent systems to learn the optimal joint control policy.", "However, existing decentralized AC algorithms either do not preserve the privacy of agents or are not sample and communication-efficient.", "In this work, we develop two decentralized AC and natural AC (NAC) algorithms that are private, and sample and communication-efficient.", "In both algorithms, agents share noisy information to preserve privacy and adopt mini-batch updates to improve sample and communication efficiency.", "Particularly for decentralized NAC, we develop a decentralized Markovian SGD algorithm with an adaptive mini-batch size to efficiently compute the natural policy gradient.", "Under Markovian sampling and linear function approximation, we prove the proposed decentralized AC and NAC algorithms achieve the state-of-the-art sample complexities $\\mathcal{O}\\big(\\epsilon^{-2}\\ln(\\epsilon^{-1})\\big)$ and $\\mathcal{O}\\big(\\epsilon^{-3}\\ln(\\epsilon^{-1})\\big)$, respectively, and the same small communication complexity $\\mathcal{O}\\big(\\epsilon^{-1}\\ln(\\epsilon^{-1})\\big)$.", "Numerical experiments demonstrate that the proposed algorithms achieve lower sample and communication complexities than the existing decentralized AC algorithm."]}
{"id": "http://arxiv.org/abs/2109.06715", "title": "IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems.", "authors": "David Pujol-Perich, Jos\u00e9 Su\u00e1rez-Varela, Miquel Ferriol, Shihan Xiao, Bo Wu, Albert Cabellos-Aparicio, Pere Barlet-Ros", "abstract": "Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking. However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming. This poses an important barrier to network engineers that often do not have the necessary ML expertise. In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems. IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures. To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases. Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow.", "sentences": ["IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems.", "Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems).", "In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing).", "The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking.", "However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming.", "This poses an important barrier to network engineers that often do not have the necessary ML expertise.", "In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems.", "IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures.", "To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases.", "Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow."]}
{"id": "http://arxiv.org/abs/2109.14569", "title": "Partitioning Cloud-based Microservices (via Deep Learning).", "authors": "Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic", "abstract": "Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices.  Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are \"brittle\"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals.  In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization.  To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.", "sentences": ["Partitioning Cloud-based Microservices (via Deep Learning).", "Cloud-based software has many advantages.", "When services are divided into many independent components, they are easier to update.", "Also, during peak demand, it is easier to scale cloud services (just hire more CPUs).", "Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices.", "Recently there has been much work using machine learning to simplify this partitioning task.", "Despite much research, no single partitioning method can be recommended as generally useful.", "More specifically, those prior solutions are \"brittle\"; i.e.", "if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals.", "In order to find a generally useful partitioning method, we propose DEEPLY.", "This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization.", "As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals.", "To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization.", "To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB."]}
{"id": "http://arxiv.org/abs/2110.06914", "title": "What Happens after SGD Reaches Zero Loss? --A Mathematical Framework.", "authors": "Zhiyuan Li, Tianhao Wang, Sanjeev Arora", "abstract": "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020).", "sentences": ["What Happens after SGD Reaches Zero Loss? --A Mathematical Framework.", "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold.", "Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence.", "In such a regime, Blanc et al.", "(2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$.", "The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).", "It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance.", "This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al.", "(2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance.", "As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples.", "This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020)."]}
{"id": "http://arxiv.org/abs/2110.07570", "title": "MGC: A Complex-Valued Graph Convolutional Network for Directed Graphs.", "authors": "Jie Zhang, Bo Hui, Po-Wei Harn, Min-Te Sun, Wei-Shinn Ku", "abstract": "Recent advancements in Graph Neural Networks have led to state-of-the-art performance on graph representation learning. However, the majority of existing works process directed graphs by symmetrization, which causes loss of directional information. To address this issue, we introduce the magnetic Laplacian, a discrete Schr\\\"odinger operator with magnetic field, which preserves edge directionality by encoding it into a complex phase with an electric charge parameter. By adopting a truncated variant of PageRank named Linear- Rank, we design and build a low-pass filter for homogeneous graphs and a high-pass filter for heterogeneous graphs. In this work, we propose a complex-valued graph convolutional network named Magnetic Graph Convolutional network (MGC). With the corresponding complex-valued techniques, we ensure our model will be degenerated into real-valued when the charge parameter is in specific values. We test our model on several graph datasets including directed homogeneous and heterogeneous graphs. The experimental results demonstrate that MGC is fast, powerful, and widely applicable.", "sentences": ["MGC: A Complex-Valued Graph Convolutional Network for Directed Graphs.", "Recent advancements in Graph Neural Networks have led to state-of-the-art performance on graph representation learning.", "However, the majority of existing works process directed graphs by symmetrization, which causes loss of directional information.", "To address this issue, we introduce the magnetic Laplacian, a discrete Schr\\\"odinger operator with magnetic field, which preserves edge directionality by encoding it into a complex phase with an electric charge parameter.", "By adopting a truncated variant of PageRank named Linear- Rank, we design and build a low-pass filter for homogeneous graphs and a high-pass filter for heterogeneous graphs.", "In this work, we propose a complex-valued graph convolutional network named Magnetic Graph Convolutional network (MGC).", "With the corresponding complex-valued techniques, we ensure our model will be degenerated into real-valued when the charge parameter is in specific values.", "We test our model on several graph datasets including directed homogeneous and heterogeneous graphs.", "The experimental results demonstrate that MGC is fast, powerful, and widely applicable."]}
{"id": "http://arxiv.org/abs/2110.07683", "title": "Toward Realistic Backdoor Injection Attacks on DNNs using Rowhammer.", "authors": "M. Caner Tol, Saad Islam, Berk Sunar, Ziming Zhang", "abstract": "State-of-the-art deep neural networks (DNNs) have been proven to be vulnerable to adversarial manipulation and backdoor attacks. Backdoored models deviate from expected behavior on inputs with predefined triggers while retaining performance on clean data. Recent works focus on software simulation of backdoor injection during the inference phase by modifying network weights, which we find often unrealistic in practice due to restrictions in hardware.  In contrast, in this work for the first time we present an end-to-end backdoor injection attack realized on actual hardware on a classifier model using Rowhammer as the fault injection method. To this end, we first investigate the viability of backdoor injection attacks in real-life deployments of DNNs on hardware and address such practical issues in hardware implementation from a novel optimization perspective. We are motivated by the fact that the vulnerable memory locations are very rare, device-specific, and sparsely distributed. Consequently, we propose a novel network training algorithm based on constrained optimization to achieve a realistic backdoor injection attack in hardware. By modifying parameters uniformly across the convolutional and fully-connected layers as well as optimizing the trigger pattern together, we achieve the state-of-the-art attack performance with fewer bit flips. For instance, our method on a hardware-deployed ResNet-20 model trained on CIFAR-10 achieves over 91% test accuracy and 94% attack success rate by flipping only 10 out of 2.2 million bits.", "sentences": ["Toward Realistic Backdoor Injection Attacks on DNNs using Rowhammer.", "State-of-the-art deep neural networks (DNNs) have been proven to be vulnerable to adversarial manipulation and backdoor attacks.", "Backdoored models deviate from expected behavior on inputs with predefined triggers while retaining performance on clean data.", "Recent works focus on software simulation of backdoor injection during the inference phase by modifying network weights, which we find often unrealistic in practice due to restrictions in hardware.", "In contrast, in this work for the first time we present an end-to-end backdoor injection attack realized on actual hardware on a classifier model using Rowhammer as the fault injection method.", "To this end, we first investigate the viability of backdoor injection attacks in real-life deployments of DNNs on hardware and address such practical issues in hardware implementation from a novel optimization perspective.", "We are motivated by the fact that the vulnerable memory locations are very rare, device-specific, and sparsely distributed.", "Consequently, we propose a novel network training algorithm based on constrained optimization to achieve a realistic backdoor injection attack in hardware.", "By modifying parameters uniformly across the convolutional and fully-connected layers as well as optimizing the trigger pattern together, we achieve the state-of-the-art attack performance with fewer bit flips.", "For instance, our method on a hardware-deployed ResNet-20 model trained on CIFAR-10 achieves over 91% test accuracy and 94% attack success rate by flipping only 10 out of 2.2 million bits."]}
{"id": "http://arxiv.org/abs/2110.09348", "title": "Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "authors": "Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian", "abstract": "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.", "sentences": ["Understanding Dimensional Collapse in Contrastive Self-supervised Learning.", "Self-supervised visual representation learning aims to learn useful representations without relying on human annotations.", "Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image.", "Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution.", "Among these methods, contrastive learning prevents collapse via negative sample pairs.", "It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space.", "Here, we show that dimensional collapse also happens in contrastive learning.", "In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse.", "Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.", "Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet."]}
{"id": "http://arxiv.org/abs/2110.12087", "title": "Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds.", "authors": "Vu Nguyen, Marc Peter Deisenroth, Michael A. Osborne", "abstract": "Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.", "sentences": ["Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds.", "Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions.", "In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known.", "More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO).", "That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior.", "To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints.", "We characterize the sample variance bounds and show that the decision made by BES is explainable.", "Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization."]}
{"id": "http://arxiv.org/abs/2110.13970", "title": "Rademacher Random Projections with Tensor Networks.", "authors": "Beheshteh T. Rakhshan, Guillaume Rabusseau", "abstract": "Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors. Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution. Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size. Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30]. In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map", "sentences": ["Rademacher Random Projections with Tensor Networks.", "Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors.", "Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution.", "Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size.", "Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30].", "In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map"]}
{"id": "http://arxiv.org/abs/2110.14068", "title": "Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks.", "authors": "Yonggan Fu, Qixuan Yu, Yang Zhang, Shang Wu, Xu Ouyang, David Cox, Yingyan Lin", "abstract": "Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i.e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions. To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples generated on the fly. Interestingly, we discover for the first time that there exist subnetworks with inborn robustness, matching or surpassing the robust accuracy of the adversarially trained networks with comparable model sizes, within randomly initialized networks without any model training, indicating that adversarial training on model weights is not indispensable towards adversarial robustness. We name such subnetworks Robust Scratch Tickets (RSTs), which are also by nature efficient. Distinct from the popular lottery ticket hypothesis, neither the original dense networks nor the identified RSTs need to be trained. To validate and understand this fascinating finding, we further conduct extensive experiments to study the existence and properties of RSTs under different models, datasets, sparsity patterns, and attacks, drawing insights regarding the relationship between DNNs' robustness and their initialization/overparameterization. Furthermore, we identify the poor adversarial transferability between RSTs of different sparsity ratios drawn from the same randomly initialized dense network, and propose a Random RST Switch (R2S) technique, which randomly switches between different RSTs, as a novel defense method built on top of RSTs. We believe our findings about RSTs have opened up a new perspective to study model robustness and extend the lottery ticket hypothesis.", "sentences": ["Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks.", "Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i.e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions.", "To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples generated on the fly.", "Interestingly, we discover for the first time that there exist subnetworks with inborn robustness, matching or surpassing the robust accuracy of the adversarially trained networks with comparable model sizes, within randomly initialized networks without any model training, indicating that adversarial training on model weights is not indispensable towards adversarial robustness.", "We name such subnetworks Robust Scratch Tickets (RSTs), which are also by nature efficient.", "Distinct from the popular lottery ticket hypothesis, neither the original dense networks nor the identified RSTs need to be trained.", "To validate and understand this fascinating finding, we further conduct extensive experiments to study the existence and properties of RSTs under different models, datasets, sparsity patterns, and attacks, drawing insights regarding the relationship between DNNs' robustness and their initialization/overparameterization.", "Furthermore, we identify the poor adversarial transferability between RSTs of different sparsity ratios drawn from the same randomly initialized dense network, and propose a Random RST Switch (R2S) technique, which randomly switches between different RSTs, as a novel defense method built on top of RSTs.", "We believe our findings about RSTs have opened up a new perspective to study model robustness and extend the lottery ticket hypothesis."]}
{"id": "http://arxiv.org/abs/2111.01134", "title": "Comparing Bayesian Models for Organ Contouring in Head and Neck Radiotherapy.", "authors": "Prerak Mody, Nicolas Chaves-de-Plaza, Klaus Hildebrandt, Rene van Egmond, Huib de Ridder, Marius Staring", "abstract": "Deep learning models for organ contouring in radiotherapy are poised for clinical usage, but currently, there exist few tools for automated quality assessment (QA) of the predicted contours. Using Bayesian models and their associated uncertainty, one can potentially automate the process of detecting inaccurate predictions. We investigate two Bayesian models for auto-contouring, DropOut and FlipOut, using a quantitative measure - expected calibration error (ECE) and a qualitative measure - region-based accuracy-vs-uncertainty (R-AvU) graphs. It is well understood that a model should have low ECE to be considered trustworthy. However, in a QA context, a model should also have high uncertainty in inaccurate regions and low uncertainty in accurate regions. Such behaviour could direct visual attention of expert users to potentially inaccurate regions, leading to a speed up in the QA process. Using R-AvU graphs, we qualitatively compare the behaviour of different models in accurate and inaccurate regions. Experiments are conducted on the MICCAI2015 Head and Neck Segmentation Challenge and on the DeepMindTCIA CT dataset using three models: DropOut-DICE, Dropout-CE (Cross Entropy) and FlipOut-CE. Quantitative results show that DropOut-DICE has the highest ECE, while Dropout-CE and FlipOut-CE have the lowest ECE. To better understand the difference between DropOut-CE and FlipOut-CE, we use the R-AvU graph which shows that FlipOut-CE has better uncertainty coverage in inaccurate regions than DropOut-CE. Such a combination of quantitative and qualitative metrics explores a new approach that helps to select which model can be deployed as a QA tool in clinical settings.", "sentences": ["Comparing Bayesian Models for Organ Contouring in Head and Neck Radiotherapy.", "Deep learning models for organ contouring in radiotherapy are poised for clinical usage, but currently, there exist few tools for automated quality assessment (QA) of the predicted contours.", "Using Bayesian models and their associated uncertainty, one can potentially automate the process of detecting inaccurate predictions.", "We investigate two Bayesian models for auto-contouring, DropOut and FlipOut, using a quantitative measure - expected calibration error (ECE) and a qualitative measure - region-based accuracy-vs-uncertainty (R-AvU) graphs.", "It is well understood that a model should have low ECE to be considered trustworthy.", "However, in a QA context, a model should also have high uncertainty in inaccurate regions and low uncertainty in accurate regions.", "Such behaviour could direct visual attention of expert users to potentially inaccurate regions, leading to a speed up in the QA process.", "Using R-AvU graphs, we qualitatively compare the behaviour of different models in accurate and inaccurate regions.", "Experiments are conducted on the MICCAI2015 Head and Neck Segmentation Challenge and on the DeepMindTCIA CT dataset using three models: DropOut-DICE, Dropout-CE (Cross Entropy) and FlipOut-CE.", "Quantitative results show that DropOut-DICE has the highest ECE, while Dropout-CE and FlipOut-CE have the lowest ECE.", "To better understand the difference between DropOut-CE and FlipOut-CE, we use the R-AvU graph which shows that FlipOut-CE has better uncertainty coverage in inaccurate regions than DropOut-CE.", "Such a combination of quantitative and qualitative metrics explores a new approach that helps to select which model can be deployed as a QA tool in clinical settings."]}
{"id": "http://arxiv.org/abs/2111.08356", "title": "Inference-Time Unlabeled Personalized Federated Learning.", "authors": "Ohad Amosy, Gal Eyal, Gal Chechik", "abstract": "In Federated learning (FL), multiple clients collaborate to learn a shared model through a central server while they keep data decentralized. Personalized federated learning (PFL) further extends FL by learning personalized models per client. In both FL and PFL, all clients participate in the training process and their labeled data is used for training. However, in reality, novel clients may wish to join a prediction service after it has been deployed, obtaining predictions for their own unlabeled data. Here, we introduce a new learning setup, Inference-Time Unlabeled PFL (ITU-PFL), where a system trained on a set of clients, needs to be later applied to novel unlabeled clients at inference time. We propose a novel approach to this problem, ITUFL-HN, which uses a hypernetwork to produce a new model for the late-to-the-party client. Specifically, we train an encoder network that learns a representation for a client given its unlabeled data. That client representation is fed to a hypernetwork that generates a personalized model for that client. Evaluated on five benchmark datasets, we find that ITUFL-HN generalizes better than current FL and PFL methods, especially when the novel client has a large domain shift from training clients. We also analyzed the generalization error for novel clients, and showed analytically and experimentally how they can apply differential privacy to their data.", "sentences": ["Inference-Time Unlabeled Personalized Federated Learning.", "In Federated learning (FL), multiple clients collaborate to learn a shared model through a central server while they keep data decentralized.", "Personalized federated learning (PFL) further extends FL by learning personalized models per client.", "In both FL and PFL, all clients participate in the training process and their labeled data is used for training.", "However, in reality, novel clients may wish to join a prediction service after it has been deployed, obtaining predictions for their own unlabeled data.", "Here, we introduce a new learning setup, Inference-Time Unlabeled PFL (ITU-PFL), where a system trained on a set of clients, needs to be later applied to novel unlabeled clients at inference time.", "We propose a novel approach to this problem, ITUFL-HN, which uses a hypernetwork to produce a new model for the late-to-the-party client.", "Specifically, we train an encoder network that learns a representation for a client given its unlabeled data.", "That client representation is fed to a hypernetwork that generates a personalized model for that client.", "Evaluated on five benchmark datasets, we find that ITUFL-HN generalizes better than current FL and PFL methods, especially when the novel client has a large domain shift from training clients.", "We also analyzed the generalization error for novel clients, and showed analytically and experimentally how they can apply differential privacy to their data."]}
{"id": "http://arxiv.org/abs/2111.15379", "title": "Text classification problems via BERT embedding method and graph convolutional neural network.", "authors": "Loc Hoang Tran, Tuan Tran, An Mai", "abstract": "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network. This combination is employed to solve the text classification problem. Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector. Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels. Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models.", "sentences": ["Text classification problems via BERT embedding method and graph convolutional neural network.", "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network.", "This combination is employed to solve the text classification problem.", "Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector.", "Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels.", "Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models."]}
{"id": "http://arxiv.org/abs/2112.02043", "title": "Multilingual training for Software Engineering.", "authors": "Toufique Ahmed, Premkumar Devanbu", "abstract": "Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.", "sentences": ["Multilingual training for Software Engineering.", "Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks.", "Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods.", "More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging.", "Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability.", "For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse.", "As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks.", "We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance.", "We study this for 3 different tasks: code summarization, code retrieval, and function naming.", "We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models."]}
{"id": "http://arxiv.org/abs/2112.03638", "title": "Scaling Structured Inference with Randomization.", "authors": "Yao Fu, John P. Cunningham, Mirella Lapata", "abstract": "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity. At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums. Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude. We further achieve low bias and variance via Rao-Blackwellization and importance sampling. Experiments over different graphs demonstrate the accuracy and efficiency of our approach. Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse", "sentences": ["Scaling Structured Inference with Randomization.", "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity.", "At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums.", "Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states.", "Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs).", "It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers.", "Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude.", "We further achieve low bias and variance via Rao-Blackwellization and importance sampling.", "Experiments over different graphs demonstrate the accuracy and efficiency of our approach.", "Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse"]}
{"id": "http://arxiv.org/abs/2112.05677", "title": "Concept Representation Learning with Contrastive Self-Supervised Learning.", "authors": "Daniel T. Chang", "abstract": "Concept-oriented deep learning (CODL) is a general approach to meet the future challenges for deep learning: (1) learning with little or no external supervision, (2) coping with test examples that come from a different distribution than the training examples, and (3) integrating deep learning with symbolic AI. In CODL, as in human learning, concept representations are learned based on concept exemplars. Contrastive self-supervised learning (CSSL) provides a promising approach to do so, since it: (1) uses data-driven associations, to get away from semantic labels, (2) supports incremental and continual learning, to get away from (large) fixed datasets, and (3) accommodates emergent objectives, to get away from fixed objectives (tasks). We discuss major aspects of concept representation learning using CSSL. These include dual-level concept representations, CSSL for feature representations, exemplar similarity measures and self-supervised relational reasoning, incremental and continual CSSL, and contrastive self-supervised concept (class) incremental learning. The discussion leverages recent findings from cognitive neural science and CSSL.", "sentences": ["Concept Representation Learning with Contrastive Self-Supervised Learning.", "Concept-oriented deep learning (CODL) is a general approach to meet the future challenges for deep learning: (1) learning with little or no external supervision, (2) coping with test examples that come from a different distribution than the training examples, and (3) integrating deep learning with symbolic AI.", "In CODL, as in human learning, concept representations are learned based on concept exemplars.", "Contrastive self-supervised learning (CSSL) provides a promising approach to do so, since it: (1) uses data-driven associations, to get away from semantic labels, (2) supports incremental and continual learning, to get away from (large) fixed datasets, and (3) accommodates emergent objectives, to get away from fixed objectives (tasks).", "We discuss major aspects of concept representation learning using CSSL.", "These include dual-level concept representations, CSSL for feature representations, exemplar similarity measures and self-supervised relational reasoning, incremental and continual CSSL, and contrastive self-supervised concept (class) incremental learning.", "The discussion leverages recent findings from cognitive neural science and CSSL."]}
{"id": "http://arxiv.org/abs/2112.05911", "title": "Learning Contraction Policies from Offline Data.", "authors": "Navid Rezazadeh, Maxwell Kolarich, Solmaz S. Kia, Negar Mehr", "abstract": "This paper proposes a data-driven method for learning convergent control policies from offline data using Contraction theory. Contraction theory enables constructing a policy that makes the closed-loop system trajectories inherently convergent towards a unique trajectory. At the technical level, identifying the contraction metric, which is the distance metric with respect to which a robot's trajectories exhibit contraction is often non-trivial. We propose to jointly learn the control policy and its corresponding contraction metric while enforcing contraction. To achieve this, we learn an implicit dynamics model of the robotic system from an offline data set consisting of the robot's state and input trajectories. Using this learned dynamics model, we propose a data augmentation algorithm for learning contraction policies. We randomly generate samples in the state-space and propagate them forward in time through the learned dynamics model to generate auxiliary sample trajectories. We then learn both the control policy and the contraction metric such that the distance between the trajectories from the offline data set and our generated auxiliary sample trajectories decreases over time. We evaluate the performance of our proposed framework on simulated robotic goal-reaching tasks and demonstrate that enforcing contraction results in faster convergence and greater robustness of the learned policy.", "sentences": ["Learning Contraction Policies from Offline Data.", "This paper proposes a data-driven method for learning convergent control policies from offline data using Contraction theory.", "Contraction theory enables constructing a policy that makes the closed-loop system trajectories inherently convergent towards a unique trajectory.", "At the technical level, identifying the contraction metric, which is the distance metric with respect to which a robot's trajectories exhibit contraction is often non-trivial.", "We propose to jointly learn the control policy and its corresponding contraction metric while enforcing contraction.", "To achieve this, we learn an implicit dynamics model of the robotic system from an offline data set consisting of the robot's state and input trajectories.", "Using this learned dynamics model, we propose a data augmentation algorithm for learning contraction policies.", "We randomly generate samples in the state-space and propagate them forward in time through the learned dynamics model to generate auxiliary sample trajectories.", "We then learn both the control policy and the contraction metric such that the distance between the trajectories from the offline data set and our generated auxiliary sample trajectories decreases over time.", "We evaluate the performance of our proposed framework on simulated robotic goal-reaching tasks and demonstrate that enforcing contraction results in faster convergence and greater robustness of the learned policy."]}
{"id": "http://arxiv.org/abs/2112.09071", "title": "A Deep Learning Based Multitask Network for Respiration Rate Estimation -- A Practical Perspective.", "authors": "Kapil Singh Rathore, Sricharan Vijayarangan, Preejith SP, Mohanasankar Sivaprakasam", "abstract": "The exponential rise in wearable sensors has garnered significant interest in assessing the physiological parameters during day-to-day activities. Respiration rate is one of the vital parameters used in the performance assessment of lifestyle activities. However, obtrusive setup for measurement, motion artifacts, and other noises complicate the process. This paper presents a multitasking architecture based on Deep Learning (DL) for estimating instantaneous and average respiration rate from ECG and accelerometer signals, such that it performs efficiently under daily living activities like cycling, walking, etc. The multitasking network consists of a combination of Encoder-Decoder and Encoder-IncResNet, to fetch the average respiration rate and the respiration signal. The respiration signal can be leveraged to obtain the breathing peaks and instantaneous breathing cycles. Mean absolute error(MAE), Root mean square error (RMSE), inference time, and parameter count analysis has been used to compare the network with the current state of art Machine Learning (ML) model and other DL models developed in previous studies. Other DL configurations based on a variety of inputs are also developed as a part of the work. The proposed model showed better overall accuracy and gave better results than individual modalities during different activities.", "sentences": ["A Deep Learning Based Multitask Network for Respiration Rate Estimation -- A Practical Perspective.", "The exponential rise in wearable sensors has garnered significant interest in assessing the physiological parameters during day-to-day activities.", "Respiration rate is one of the vital parameters used in the performance assessment of lifestyle activities.", "However, obtrusive setup for measurement, motion artifacts, and other noises complicate the process.", "This paper presents a multitasking architecture based on Deep Learning (DL) for estimating instantaneous and average respiration rate from ECG and accelerometer signals, such that it performs efficiently under daily living activities like cycling, walking, etc.", "The multitasking network consists of a combination of Encoder-Decoder and Encoder-IncResNet, to fetch the average respiration rate and the respiration signal.", "The respiration signal can be leveraged to obtain the breathing peaks and instantaneous breathing cycles.", "Mean absolute error(MAE), Root mean square error (RMSE), inference time, and parameter count analysis has been used to compare the network with the current state of art Machine Learning (ML) model and other DL models developed in previous studies.", "Other DL configurations based on a variety of inputs are also developed as a part of the work.", "The proposed model showed better overall accuracy and gave better results than individual modalities during different activities."]}
{"id": "http://arxiv.org/abs/2112.12376", "title": "Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization.", "authors": "Yihua Zhang, Guanhua Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, Sijia Liu", "abstract": "Adversarial training (AT) is a widely recognized defense mechanism to gain the robustness of deep neural networks against adversarial attacks. It is built on min-max optimization (MMO), where the minimizer (i.e., defender) seeks a robust model to minimize the worst-case training loss in the presence of adversarial examples crafted by the maximizer (i.e., attacker). However, the conventional MMO method makes AT hard to scale. Thus, Fast-AT and other recent algorithms attempt to simplify MMO by replacing its maximization step with the single gradient sign-based attack generation step. Although easy to implement, FAST-AT lacks theoretical guarantees, and its empirical performance is unsatisfactory due to the issue of robust catastrophic overfitting when training with strong adversaries. In this paper, we advance Fast-AT from the fresh perspective of bi-level optimization (BLO). We first show that the commonly-used Fast-AT is equivalent to using a stochastic gradient algorithm to solve a linearized BLO problem involving a sign operation. However, the discrete nature of the sign operation makes it difficult to understand the algorithm performance. Inspired by BLO, we design and analyze a new set of robust training algorithms termed Fast Bi-level AT (Fast-BAT), which effectively defends sign-based projected gradient descent (PGD) attacks without using any gradient sign method or explicit robust regularization. In practice, we show that our method yields substantial robustness improvements over multiple baselines across multiple models and datasets. All code for reproducing the experiments in this paper is at https://github.com/NormalUhr/Fast_BAT.", "sentences": ["Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization.", "Adversarial training (AT) is a widely recognized defense mechanism to gain the robustness of deep neural networks against adversarial attacks.", "It is built on min-max optimization (MMO), where the minimizer (i.e., defender) seeks a robust model to minimize the worst-case training loss in the presence of adversarial examples crafted by the maximizer (i.e., attacker).", "However, the conventional MMO method makes AT hard to scale.", "Thus, Fast-AT and other recent algorithms attempt to simplify MMO by replacing its maximization step with the single gradient sign-based attack generation step.", "Although easy to implement, FAST-AT lacks theoretical guarantees, and its empirical performance is unsatisfactory due to the issue of robust catastrophic overfitting when training with strong adversaries.", "In this paper, we advance Fast-AT from the fresh perspective of bi-level optimization (BLO).", "We first show that the commonly-used Fast-AT is equivalent to using a stochastic gradient algorithm to solve a linearized BLO problem involving a sign operation.", "However, the discrete nature of the sign operation makes it difficult to understand the algorithm performance.", "Inspired by BLO, we design and analyze a new set of robust training algorithms termed Fast Bi-level AT (Fast-BAT), which effectively defends sign-based projected gradient descent (PGD) attacks without using any gradient sign method or explicit robust regularization.", "In practice, we show that our method yields substantial robustness improvements over multiple baselines across multiple models and datasets.", "All code for reproducing the experiments in this paper is at https://github.com/NormalUhr/Fast_BAT."]}
{"id": "http://arxiv.org/abs/2201.08712", "title": "Improved Random Features for Dot Product Kernels.", "authors": "Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone", "abstract": "Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.", "sentences": ["Improved Random Features for Dot Product Kernels.", "Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems.", "We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning.", "First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features.", "We show empirically that the use of complex features can significantly reduce the variances of these approximations.", "Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances.", "These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features.", "Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel.", "We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets."]}
{"id": "http://arxiv.org/abs/2201.10328", "title": "ML4CO-KIDA: Knowledge Inheritance in Dataset Aggregation.", "authors": "Zixuan Cao, Yang Xu, Zhewei Huang, Shuchang Zhou", "abstract": "The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models. On the dual task, we design models to make branching decisions to promote the dual bound increase faster. We propose a knowledge inheritance method to generalize knowledge of different models from the dataset aggregation process, named KIDA. Our improvement overcomes some defects of the baseline graph-neural-networks-based methods. Further, we won the $1$\\textsuperscript{st} Place on the dual task. We hope this report can provide useful experience for developers and researchers. The code is available at https://github.com/megvii-research/NeurIPS2021-ML4CO-KIDA.", "sentences": ["ML4CO-KIDA: Knowledge Inheritance in Dataset Aggregation.", "The Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition aims to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models.", "On the dual task, we design models to make branching decisions to promote the dual bound increase faster.", "We propose a knowledge inheritance method to generalize knowledge of different models from the dataset aggregation process, named KIDA.", "Our improvement overcomes some defects of the baseline graph-neural-networks-based methods.", "Further, we won the $1$\\textsuperscript{st} Place on the dual task.", "We hope this report can provide useful experience for developers and researchers.", "The code is available at https://github.com/megvii-research/NeurIPS2021-ML4CO-KIDA."]}
{"id": "http://arxiv.org/abs/2201.12674", "title": "Rewiring with Positional Encodings for Graph Neural Networks.", "authors": "Rickard Br\u00fcel-Gabrielsson, Mikhail Yurochkin, Justin Solomon", "abstract": "Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to any r-ring. Our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. Thus, it is compatible with many existing GNN architectures. We also provide examples of positional encodings that are non-invasive, i.e., there is a one-to-one map between the original and the modified graphs. Our experiments demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small r. We obtain improvements across models, showing state-of-the-art performance even using older architectures than recent Transformer models adapted to graphs.", "sentences": ["Rewiring with Positional Encodings for Graph Neural Networks.", "Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms.", "These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments.", "As a conservative alternative, we use positional encodings to expand receptive fields to any r-ring.", "Our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features.", "Thus, it is compatible with many existing GNN architectures.", "We also provide examples of positional encodings that are non-invasive, i.e., there is a one-to-one map between the original and the modified graphs.", "Our experiments demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small r. We obtain improvements across models, showing state-of-the-art performance even using older architectures than recent Transformer models adapted to graphs."]}
{"id": "http://arxiv.org/abs/2201.12733", "title": "TPC: Transformation-Specific Smoothing for Point Cloud Models.", "authors": "Wenda Chu, Linyi Li, Bo Li", "abstract": "Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles. However, such models are shown vulnerable against adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions. In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks. We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively. We then specify unique certification protocols for a range of specific semantic transformations and their compositions. Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art. For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\\circ$) from 20.3$\\%$ to 83.8$\\%$.", "sentences": ["TPC: Transformation-Specific Smoothing for Point Cloud Models.", "Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles.", "However, such models are shown vulnerable against adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions.", "In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks.", "We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively.", "We then specify unique certification protocols for a range of specific semantic transformations and their compositions.", "Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art.", "For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\\circ$) from 20.3$\\%$ to 83.8$\\%$."]}
{"id": "http://arxiv.org/abs/2201.13195", "title": "Memory-Efficient Backpropagation through Large Linear Layers.", "authors": "Daniel Bershatsky, Aleksandr Mikhalev, Alexandr Katrutsa, Julia Gusak, Daniil Merkulov, Ivan Oseledets", "abstract": "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.", "sentences": ["Memory-Efficient Backpropagation through Large Linear Layers.", "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.", "This study proposes a memory reduction approach to perform backpropagation through linear layers.", "Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy.", "Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication.", "We compare this variance with the variance coming from gradient estimation based on the batch of samples.", "We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks."]}
{"id": "http://arxiv.org/abs/2202.00063", "title": "Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach.", "authors": "Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun", "abstract": "We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration.", "sentences": ["Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach.", "We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states.", "BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space.", "Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration."]}
{"id": "http://arxiv.org/abs/2202.00441", "title": "Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction.", "authors": "Georgii Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Dimitrov, Ivan Oseledets", "abstract": "Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.", "sentences": ["Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction.", "Memory footprint is one of the main limiting factors for large neural network training.", "In backpropagation, one needs to store the input to each operation in the computational graph.", "Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients.", "We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element.", "We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming.", "The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline.", "We confirm the memory reduction and the same convergence on several open benchmarks."]}
{"id": "http://arxiv.org/abs/2202.00450", "title": "Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra.", "authors": "Liang Liao, Sen Lin, Lun Li, Xiuwei Zhang, Song Zhao, Yan Wang, Xinqiang Wang, Qi Gao, Jingyu Wang", "abstract": "Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to \"deeper-order\" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.", "sentences": ["Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra.", "Low-rank approximation of images via singular value decomposition is well-received in the era of big data.", "However, singular value decomposition (SVD) is only for order-two data, i.e., matrices.", "It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD.", "Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components.", "We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra.", "This algebra, referred to as a t-algebra, generalizes the field of complex numbers.", "The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers.", "One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions.", "The generalization of HOSVD is called THOSVD.", "Its performance of approximating multi-way data can be further improved by an alternating algorithm.", "THOSVD also unifies a wide range of principal component analysis algorithms.", "To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to \"deeper-order\" t-scalar.", "Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts."]}
{"id": "http://arxiv.org/abs/2202.00738", "title": "LocUNet: Fast Urban Positioning Using Radio Maps and Deep Learning.", "authors": "\u00c7a\u011fkan Yapar, Ron Levie, Gitta Kutyniok, Giuseppe Caire", "abstract": "This paper deals with the problem of localization in a cellular network in a dense urban scenario. Global Navigation Satellite Systems (GNSS) typically perform poorly in urban environments, where the likelihood of line-of-sight conditions is low, and thus alternative localization methods are required for good accuracy. We present LocUNet: A deep learning method for localization, based merely on Received Signal Strength (RSS) from Base Stations (BSs), which does not require any increase in computation complexity at the user devices with respect to the device standard operations, unlike methods that rely on time of arrival or angle of arrival information. In the proposed method, the user to be localized reports the RSS from BSs to a Central Processing Unit (CPU), which may be located in the cloud. Alternatively, the localization can be performed locally at the user. Using estimated pathloss radio maps of the BSs, LocUNet can localize users with state-of-the-art accuracy and enjoys high robustness to inaccuracies in the radio maps. The proposed method does not require pre-sampling of the environment; and is suitable for real-time applications, thanks to the RadioUNet, a neural network-based radio map estimator. We also introduce two datasets that allow numerical comparisons of RSS and Time of Arrival (ToA) methods in realistic urban environments.", "sentences": ["LocUNet: Fast Urban Positioning Using Radio Maps and Deep Learning.", "This paper deals with the problem of localization in a cellular network in a dense urban scenario.", "Global Navigation Satellite Systems (GNSS) typically perform poorly in urban environments, where the likelihood of line-of-sight conditions is low, and thus alternative localization methods are required for good accuracy.", "We present LocUNet: A deep learning method for localization, based merely on Received Signal Strength (RSS) from Base Stations (BSs), which does not require any increase in computation complexity at the user devices with respect to the device standard operations, unlike methods that rely on time of arrival or angle of arrival information.", "In the proposed method, the user to be localized reports the RSS from BSs to a Central Processing Unit (CPU), which may be located in the cloud.", "Alternatively, the localization can be performed locally at the user.", "Using estimated pathloss radio maps of the BSs, LocUNet can localize users with state-of-the-art accuracy and enjoys high robustness to inaccuracies in the radio maps.", "The proposed method does not require pre-sampling of the environment; and is suitable for real-time applications, thanks to the RadioUNet, a neural network-based radio map estimator.", "We also introduce two datasets that allow numerical comparisons of RSS and Time of Arrival (ToA) methods in realistic urban environments."]}
{"id": "http://arxiv.org/abs/2202.00824", "title": "KSD Aggregated Goodness-of-fit Test.", "authors": "Antonin Schrab, Benjamin Guedj, Arthur Gretton", "abstract": "We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.", "sentences": ["KSD Aggregated Goodness-of-fit Test.", "We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD).", "We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels.", "KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels.", "We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term.", "KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections.", "In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting.", "We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures."]}
{"id": "http://arxiv.org/abs/2202.00834", "title": "Algorithms for Efficiently Learning Low-Rank Neural Networks.", "authors": "Kiran Vodrahalli, Rakesh Shivanna, Maheswaran Sathiamoorthy, Sagar Jain, Ed H. Chi", "abstract": "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable. To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet.", "sentences": ["Algorithms for Efficiently Learning Low-Rank Neural Networks.", "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices.", "First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples.", "Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable.", "To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space.", "Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension).", "Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet."]}
{"id": "http://arxiv.org/abs/2202.01011", "title": "Auto-Transfer: Learning to Route Transferrable Representations.", "authors": "Keerthiram Murugesan, Vijay Sadashivaiah, Ronny Luss, Karthikeyan Shanmugam, Pin-Yu Chen, Amit Dhurandhar", "abstract": "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications. Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting. We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models. We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet. We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors. We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning.", "sentences": ["Auto-Transfer: Learning to Route Transferrable Representations.", "Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications.", "Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting.", "We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models.", "We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet.", "We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors.", "We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning."]}
{"id": "http://arxiv.org/abs/2202.01197", "title": "VOS: Learning What You Don't Know by Virtual Outlier Synthesis.", "authors": "Xuefeng Du, Zhaoning Wang, Mu Cai, Yixuan Li", "abstract": "Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves state-of-the-art performance on both object detection and image classification models, reducing the FPR95 by up to 7.87% compared to the previous best method. Code is available at https://github.com/deeplearning-wisc/vos.", "sentences": ["VOS: Learning What You Don't Know by Virtual Outlier Synthesis.", "Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks.", "One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data.", "Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice.", "In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training.", "Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space.", "Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data.", "VOS achieves state-of-the-art performance on both object detection and image classification models, reducing the FPR95 by up to 7.87% compared to the previous best method.", "Code is available at https://github.com/deeplearning-wisc/vos."]}
{"id": "http://arxiv.org/abs/2202.00796", "title": "On the Benefits of Selectivity in Pseudo-Labeling for Unsupervised Multi-Source-Free Domain Adaptation.", "authors": "Maohao Shen, Yuheng Bu, Gregory Wornell", "abstract": "Due to privacy, storage, and other constraints, there is a growing need for unsupervised domain adaptation techniques in machine learning that do not require access to the data used to train a collection of source models. Existing methods for such multi-source-free domain adaptation typically train a target model using supervised techniques in conjunction with pseudo-labels for the target data, which are produced by the available source models. However, we show that assigning pseudo-labels to only a subset of the target data leads to improved performance. In particular, we develop an information-theoretic bound on the generalization error of the resulting target model that demonstrates an inherent bias-variance trade-off controlled by the subset choice. Guided by this analysis, we develop a method that partitions the target data into pseudo-labeled and unlabeled subsets to balance the trade-off. In addition to exploiting the pseudo-labeled subset, our algorithm further leverages the information in the unlabeled subset via a traditional unsupervised domain adaptation feature alignment procedure. Experiments on multiple benchmark datasets demonstrate the superior performance of the proposed method.", "sentences": ["On the Benefits of Selectivity in Pseudo-Labeling for Unsupervised Multi-Source-Free Domain Adaptation.", "Due to privacy, storage, and other constraints, there is a growing need for unsupervised domain adaptation techniques in machine learning that do not require access to the data used to train a collection of source models.", "Existing methods for such multi-source-free domain adaptation typically train a target model using supervised techniques in conjunction with pseudo-labels for the target data, which are produced by the available source models.", "However, we show that assigning pseudo-labels to only a subset of the target data leads to improved performance.", "In particular, we develop an information-theoretic bound on the generalization error of the resulting target model that demonstrates an inherent bias-variance trade-off controlled by the subset choice.", "Guided by this analysis, we develop a method that partitions the target data into pseudo-labeled and unlabeled subsets to balance the trade-off.", "In addition to exploiting the pseudo-labeled subset, our algorithm further leverages the information in the unlabeled subset via a traditional unsupervised domain adaptation feature alignment procedure.", "Experiments on multiple benchmark datasets demonstrate the superior performance of the proposed method."]}
{"id": "http://arxiv.org/abs/2202.01116", "title": "Unpaired Image Super-Resolution with Optimal Transport Maps.", "authors": "Milena Gazdieva, Litu Rout, Alexander Korotin, Alexander Filippov, Evgeny Burnaev", "abstract": "Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs) which yield complex training losses with several regularization terms such as content and identity losses. We theoretically investigate the optimization problems which arise in such models and find two surprising observations. First, the learned SR map is always an optimal transport (OT) map. Second, we empirically show that the learned map is biased, i.e., it may not actually transform the distribution of low-resolution images to high-resolution images. Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost. Unlike existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the neccesity to perform complex hyperparameter selection and use additional regularizations. At the same time, it provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset.", "sentences": ["Unpaired Image Super-Resolution with Optimal Transport Maps.", "Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques.", "As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs) which yield complex training losses with several regularization terms such as content and identity losses.", "We theoretically investigate the optimization problems which arise in such models and find two surprising observations.", "First, the learned SR map is always an optimal transport (OT) map.", "Second, we empirically show that the learned map is biased, i.e., it may not actually transform the distribution of low-resolution images to high-resolution images.", "Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost.", "Unlike existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the neccesity to perform complex hyperparameter selection and use additional regularizations.", "At the same time, it provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset."]}
{"id": "http://arxiv.org/abs/2202.01210", "title": "Deep Layer-wise Networks Have Closed-Form Weights.", "authors": "Chieh Wu, Aria Masoomi, Arthur Gretton, Jennifer Dy", "abstract": "There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \\textit{one layer at a time} with only a \"single forward pass\" has been proposed as an alternative to bypass BP; we refer to these networks as \"layer-wise\" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\\textit{do they have a closed-form solution?}$ Second, $\\textit{how do we know when to stop adding more layers?}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\\textit{Neural Indicator Kernel}$.", "sentences": ["Deep Layer-wise Networks Have Closed-Form Weights.", "There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP).", "To better mimic the brain, training a network \\textit{one layer at a time} with only a \"single forward pass\" has been proposed as an alternative to bypass BP; we refer to these networks as \"layer-wise\" networks.", "We continue the work on layer-wise networks by answering two outstanding questions.", "First, $\\textit{do they have a closed-form solution?", "}$ Second, $\\textit{how do we know when to stop adding more layers?", "}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\\textit{Neural Indicator Kernel}$."]}
{"id": "http://arxiv.org/abs/2202.01243", "title": "Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference.", "authors": "Jasper Tan, Blake Mason, Hamid Javadi, Richard G. Baraniuk", "abstract": "A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error.", "sentences": ["Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference.", "A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data).", "This has led to an arms race towards increasingly overparameterized models (c.f., deep learning).", "In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model.", "We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters.", "Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior.", "Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error."]}
{"id": "http://arxiv.org/abs/2202.01263", "title": "NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections.", "authors": "N. Benjamin Erichson, Soon Hoe Lim, Francisco Utrera, Winnie Xu, Ziang Cao, Michael W. Mahoney", "abstract": "For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic. Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts. Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy. This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities. We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix.", "sentences": ["NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections.", "For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic.", "Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts.", "Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy.", "This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities.", "We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P.", "Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix."]}
{"id": "http://arxiv.org/abs/2202.01267", "title": "FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations.", "authors": "Jinhyun So, Kevin Hsieh, Behnaz Arzani, Shadi Noghabi, Salman Avestimehr, Ranveer Chandra", "abstract": "Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation. However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution. To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites. We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness. We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits. Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms.", "sentences": ["FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations.", "Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation.", "However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution.", "To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites.", "We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness.", "We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits.", "Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms."]}
{"id": "http://arxiv.org/abs/2202.01269", "title": "Robust Estimation for Nonparametric Families via Generative Adversarial Networks.", "authors": "Banghua Zhu, Jiantao Jiao, Michael I. Jordan", "abstract": "We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.", "sentences": ["Robust Estimation for Nonparametric Families via Generative Adversarial Networks.", "We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples.", "Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem.", "Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions.", "We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation.", "In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work."]}
{"id": "http://arxiv.org/abs/2202.01277", "title": "Global Optimization Networks.", "authors": "Sen Zhao, Erez Louidor, Olexander Mangylov, Maya Gupta", "abstract": "We consider the problem of estimating a good maximizer of a black-box function given noisy examples. To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\\mathcal{O}(D)$ time. In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models. We also extend to \\emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions. Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems.", "sentences": ["Global Optimization Networks.", "We consider the problem of estimating a good maximizer of a black-box function given noisy examples.", "To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\\mathcal{O}(D)$ time.", "In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models.", "We also extend to \\emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions.", "Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems."]}
{"id": "http://arxiv.org/abs/2202.01287", "title": "Fenrir: Physics-Enhanced Regression for Initial Value Problems.", "authors": "Filip Tronarp, Nathanael Bosch, Philipp Hennig", "abstract": "We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem. Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier. The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated. In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration. Experimental results demonstrate that the method is on par or moderately better than competing approaches.", "sentences": ["Fenrir: Physics-Enhanced Regression for Initial Value Problems.", "We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem.", "Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier.", "The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated.", "In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration.", "Experimental results demonstrate that the method is on par or moderately better than competing approaches."]}
{"id": "http://arxiv.org/abs/2202.01314", "title": "Gradient estimators for normalising flows.", "authors": "Piotr Bialas, Piotr Korcyl, Tomasz Stebel", "abstract": "Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction. In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution. As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields. In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions. We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results.", "sentences": ["Gradient estimators for normalising flows.", "Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction.", "In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution.", "As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields.", "In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions.", "We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results."]}
{"id": "http://arxiv.org/abs/2202.01361", "title": "Generative Flow Networks for Discrete Probabilistic Modeling.", "authors": "Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, Yoshua Bengio", "abstract": "We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks.", "sentences": ["Generative Flow Networks for Discrete Probabilistic Modeling.", "We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data.", "Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet.", "We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes.", "We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet.", "We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks."]}
{"id": "http://arxiv.org/abs/2202.01454", "title": "Deep Hierarchy in Bandits.", "authors": "Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh", "abstract": "Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.", "sentences": ["Deep Hierarchy in Bandits.", "Mean rewards of actions are often correlated.", "The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories.", "To maximize statistical efficiency, it is important to leverage these correlations when learning.", "We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables.", "Since the hierarchy can have multiple layers, we call it deep.", "We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies.", "The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest.", "We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits.", "Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions.", "We confirm these theoretical findings empirically, in both synthetic and real-world experiments."]}
{"id": "http://arxiv.org/abs/2202.01456", "title": "Fast and explainable clustering based on sorting.", "authors": "Xinye Chen, Stefan G\u00fcttel", "abstract": "We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.", "sentences": ["Fast and explainable clustering based on sorting.", "We introduce a fast and explainable clustering method called CLASSIX.", "It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters.", "The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size.", "Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality.", "Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms.", "The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems.", "Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters."]}
{"id": "http://arxiv.org/abs/2202.01463", "title": "Minimax rate of consistency for linear models with missing values.", "authors": "Alexis Ayme (LPSM (UMR\\_8001)), Claire Boyer (LPSM (UMR\\_8001), MOKAPLAN), Aymeric Dieuleveut (CMAP), Erwan Scornet (CMAP)", "abstract": "Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.", "sentences": ["Minimax rate of consistency for linear models with missing values.", "Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...).", "In fact, the very nature of missing values usually prevents us from running standard learning algorithms.", "In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task.", "Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern.", "This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.", "First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension.", "Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal.", "Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values."]}
{"id": "http://arxiv.org/abs/2202.01545", "title": "Byzantine-Robust Decentralized Learning via Self-Centered Clipping.", "authors": "Lie He, Sai Praneeth Karimireddy, Martin Jaggi", "abstract": "In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus. We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration. To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\\delta_{\\max}\\zeta^2/\\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks.", "sentences": ["Byzantine-Robust Decentralized Learning via Self-Centered Clipping.", "In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs.", "Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus.", "We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration.", "To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\\delta_{\\max}\\zeta^2/\\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions.", "Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks."]}
{"id": "http://arxiv.org/abs/2202.01562", "title": "Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.", "authors": "Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto", "abstract": "In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.", "sentences": ["Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.", "In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical.", "Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data.", "Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space.", "To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable.", "However, an unrealistic assumption may, in turn, cause serious bias.", "Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies.", "To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking.", "We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions.", "Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate.", "Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings."]}
{"id": "http://arxiv.org/abs/2202.01566", "title": "Unified theory of atom-centered representations and graph convolutional machine-learning schemes.", "authors": "Jigyasa Nigam, Guillaume Fraux, Michele Ceriotti", "abstract": "Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents. Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity. Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets. Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom. We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes.", "sentences": ["Unified theory of atom-centered representations and graph convolutional machine-learning schemes.", "Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents.", "Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity.", "Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets.", "Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom.", "We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes."]}
{"id": "http://arxiv.org/abs/2202.01614", "title": "The RoyalFlush System of Speech Recognition for M2MeT Challenge.", "authors": "Shuaishuai Ye, Peiyao Wang, Shunfei Chen, Xinhui Hu, Xinkang Xu", "abstract": "This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.", "sentences": ["The RoyalFlush System of Speech Recognition for M2MeT Challenge.", "This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.", "We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.", "Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets.", "But we only selected WPE and beamforming as our frontend methods according to their experimental results.", "Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement.", "Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results.", "Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set."]}
{"id": "http://arxiv.org/abs/2202.01619", "title": "On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres.", "authors": "Benyamin Ghojogh, Fakhri Karray, Mark Crowley", "abstract": "Consider a set of $n$ data points in the Euclidean space $\\mathbb{R}^d$. This set is called dataset in machine learning and data science. Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability. All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis. In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional. Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$. Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\\{1, 2, \\dots, d-1\\}$. For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data. For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface. Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps. We connect the hyper-caps to one another using partial hyper-cylinders. By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements. We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity. Some discussion are also provided for the linearity and structure of data. This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology.", "sentences": ["On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres.", "Consider a set of $n$ data points in the Euclidean space $\\mathbb{R}^d$.", "This set is called dataset in machine learning and data science.", "Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability.", "All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis.", "In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional.", "Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$.", "Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\\{1, 2, \\dots, d-1\\}$.", "For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data.", "For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface.", "Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps.", "We connect the hyper-caps to one another using partial hyper-cylinders.", "By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements.", "We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity.", "Some discussion are also provided for the linearity and structure of data.", "This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology."]}
{"id": "http://arxiv.org/abs/2202.01625", "title": "Efficient learning of hidden state LTI state space models of unknown order.", "authors": "Boualem Djehiche, Othmane Mazhar", "abstract": "The aim of this paper is to address two related estimation problems arising in the setup of hidden state linear time invariant (LTI) state space systems when the dimension of the hidden state is unknown. Namely, the estimation of any finite number of the system's Markov parameters and the estimation of a minimal realization for the system, both from the partial observation of a single trajectory. For both problems, we provide statistical guarantees in the form of various estimation error upper bounds, $\\rank$ recovery conditions, and sample complexity estimates.  Specifically, we first show that the low $\\rank$ solution of the Hankel penalized least square estimator satisfies an estimation error in $S_p$-norms for $p \\in [1,2]$ that captures the effect of the system order better than the existing operator norm upper bound for the simple least square. We then provide a stability analysis for an estimation procedure based on a variant of the Ho-Kalman algorithm that improves both the dependence on the dimension and the least singular value of the Hankel matrix of the Markov parameters. Finally, we propose an estimation algorithm for the minimal realization that uses both the Hankel penalized least square estimator and the Ho-Kalman based estimation procedure and guarantees with high probability that we recover the correct order of the system and satisfies a new fast rate in the $S_2$-norm with a polynomial reduction in the dependence on the dimension and other parameters of the problem.", "sentences": ["Efficient learning of hidden state LTI state space models of unknown order.", "The aim of this paper is to address two related estimation problems arising in the setup of hidden state linear time invariant (LTI) state space systems when the dimension of the hidden state is unknown.", "Namely, the estimation of any finite number of the system's Markov parameters and the estimation of a minimal realization for the system, both from the partial observation of a single trajectory.", "For both problems, we provide statistical guarantees in the form of various estimation error upper bounds, $\\rank$ recovery conditions, and sample complexity estimates.", "Specifically, we first show that the low $\\rank$ solution of the Hankel penalized least square estimator satisfies an estimation error in $S_p$-norms for $p \\in [1,2]$ that captures the effect of the system order better than the existing operator norm upper bound for the simple least square.", "We then provide a stability analysis for an estimation procedure based on a variant of the Ho-Kalman algorithm that improves both the dependence on the dimension and the least singular value of the Hankel matrix of the Markov parameters.", "Finally, we propose an estimation algorithm for the minimal realization that uses both the Hankel penalized least square estimator and the Ho-Kalman based estimation procedure and guarantees with high probability that we recover the correct order of the system and satisfies a new fast rate in the $S_2$-norm with a polynomial reduction in the dependence on the dimension and other parameters of the problem."]}
{"id": "http://arxiv.org/abs/2202.01627", "title": "Non-Vacuous Generalisation Bounds for Shallow Neural Networks.", "authors": "Felix Biggs, Benjamin Guedj", "abstract": "We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function (\"erf\") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.", "sentences": ["Non-Vacuous Generalisation Bounds for Shallow Neural Networks.", "We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function (\"erf\") activation or a Gaussian Error Linear Unit (GELU) activation.", "For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters.", "Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST."]}
{"id": "http://arxiv.org/abs/2202.01661", "title": "Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "authors": "Anay Mehrotra, Bary S. R. Pradelski, Nisheeth K. Vishnoi", "abstract": "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.", "sentences": ["Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints.", "In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker.", "Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection.", "Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group.", "However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality.", "We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias.", "On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered.", "Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality."]}
{"id": "http://arxiv.org/abs/2202.01666", "title": "Equality Is Not Equity: Proportional Fairness in Federated Learning.", "authors": "Guojun Zhang, Saber Malekmohammadi, Xi Chen, Yaoliang Yu", "abstract": "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers. This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients. Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance. In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties. We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity.", "sentences": ["Equality Is Not Equity: Proportional Fairness in Federated Learning.", "Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers.", "This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients.", "Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance.", "In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory.", "By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions.", "Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties.", "We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity."]}
{"id": "http://arxiv.org/abs/2202.01671", "title": "Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets.", "authors": "Tal Shnitzer, Mikhail Yurochkin, Kristjan Greenewald, Justin Solomon", "abstract": "The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically compare such operators in a pointwise manner or assume known data alignment. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.", "sentences": ["Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets.", "The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets.", "We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples.", "Existing methods typically compare such operators in a pointwise manner or assume known data alignment.", "Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric.", "Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities.", "Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains."]}
{"id": "http://arxiv.org/abs/2202.01694", "title": "Variational Nearest Neighbor Gaussian Processes.", "authors": "Luhuan Wu, Geoff Pleiss, John Cunningham", "abstract": "Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.", "sentences": ["Variational Nearest Neighbor Gaussian Processes.", "Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix.", "In this work, we instead exploit a sparse approximation of the precision matrix.", "We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure.", "Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$).", "Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location.", "We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods."]}
{"id": "http://arxiv.org/abs/2202.01748", "title": "Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise.", "authors": "Gabriel Ruiz, Oscar Hernan Madrid Padilla, Qing Zhou", "abstract": "Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify \"what causes what?\" Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM). Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG. We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data. We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure.", "sentences": ["Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise.", "Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify \"what causes what?\"", "Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions.", "The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM).", "Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG.", "At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering.", "Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG.", "We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data.", "We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure."]}
{"id": "http://arxiv.org/abs/2202.01752", "title": "Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "authors": "Yu Bai, Chi Jin, Song Mei, Tiancheng Yu", "abstract": "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.", "sentences": ["Near-Optimal Learning of Extensive-Form Games with Imperfect Information.", "This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback.", "We present the first line of algorithms that require only $\\widetilde{\\mathcal{O}}((XA+YB)/\\varepsilon^2)$ episodes of play to find an $\\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players.", "This improves upon the best known sample complexity of $\\widetilde{\\mathcal{O}}((X^2A+Y^2B)/\\varepsilon^2)$ by a factor of $\\widetilde{\\mathcal{O}}(\\max\\{X, Y\\})$, and matches the information-theoretic lower bound up to logarithmic factors.", "We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization.", "Both algorithms rely on novel approaches of integrating \\emph{balanced exploration policies} into their classical counterparts.", "We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games."]}
{"id": "http://arxiv.org/abs/2202.01773", "title": "Multiclass learning with margin: exponential rates with no bias-variance trade-off.", "authors": "Stefano Vigogna, Giacomo Meanti, Ernesto De Vito, Lorenzo Rosasco", "abstract": "We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting.", "sentences": ["Multiclass learning with margin: exponential rates with no bias-variance trade-off.", "We study the behavior of error bounds for multiclass classification under suitable margin conditions.", "For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off.", "Different convergence rates can be obtained in correspondence of different margin assumptions.", "With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting."]}
{"id": "http://arxiv.org/abs/1909.01132", "title": "PageRank algorithm for Directed Hypergraph.", "authors": "Loc Tran, Tho Quan, An Mai", "abstract": "During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.", "sentences": ["PageRank algorithm for Directed Hypergraph.", "During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph.", "In this paper, we will model the World Wide Web's link structure as the directed hypergraph.", "Moreover, we will develop the PageRank algorithm for this directed hypergraph.", "Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself.", "The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network."]}
{"id": "http://arxiv.org/abs/2002.01800", "title": "Sharpe Ratio Analysis in High Dimensions: Residual-Based Nodewise Regression in Factor Models.", "authors": "Mehmet Caner, Marcelo Medeiros, Gabriel Vasconcelos", "abstract": "We provide a new theory for nodewise regression when the residuals from a fitted factor model are used. We apply our results to the analysis of the consistency of Sharpe ratio estimators when there are many assets in a portfolio. We allow for an increasing number of assets as well as time observations of the portfolio. Since the nodewise regression is not feasible due to the unknown nature of idiosyncratic errors, we provide a feasible-residual-based nodewise regression to estimate the precision matrix of errors which is consistent even when number of assets, p, exceeds the time span of the portfolio, n. In another new development, we also show that the precision matrix of returns can be estimated consistently, even with an increasing number of factors and p>n. We show that: (1) with p>n, the Sharpe ratio estimators are consistent in global minimum-variance and mean-variance portfolios; and (2) with p>n, the maximum Sharpe ratio estimator is consistent when the portfolio weights sum to one; and (3) with p<<n, the maximum-out-of-sample Sharpe ratio estimator is consistent.", "sentences": ["Sharpe Ratio Analysis in High Dimensions: Residual-Based Nodewise Regression in Factor Models.", "We provide a new theory for nodewise regression when the residuals from a fitted factor model are used.", "We apply our results to the analysis of the consistency of Sharpe ratio estimators when there are many assets in a portfolio.", "We allow for an increasing number of assets as well as time observations of the portfolio.", "Since the nodewise regression is not feasible due to the unknown nature of idiosyncratic errors, we provide a feasible-residual-based nodewise regression to estimate the precision matrix of errors which is consistent even when number of assets, p, exceeds the time span of the portfolio, n. In another new development, we also show that the precision matrix of returns can be estimated consistently, even with an increasing number of factors and p>n.", "We show that: (1) with p>n, the Sharpe ratio estimators are consistent in global minimum-variance and mean-variance portfolios; and (2) with p>n, the maximum Sharpe ratio estimator is consistent when the portfolio weights sum to one; and (3) with p<<n, the maximum-out-of-sample Sharpe ratio estimator is consistent."]}
{"id": "http://arxiv.org/abs/2002.11875", "title": "Optimality and Stability in Non-Convex Smooth Games.", "authors": "Guojun Zhang, Pascal Poupart, Yaoliang Yu", "abstract": "Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.", "sentences": ["Optimality and Stability in Non-Convex Smooth Games.", "Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications.", "It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points.", "An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm.", "This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions.", "We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions.", "In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points.", "Finally, we study the stability of gradient algorithms near local minimax points.", "Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases.", "This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games."]}
{"id": "http://arxiv.org/abs/2007.02411", "title": "Assessing External Validity Over Worst-case Subpopulations.", "authors": "Sookyo Jeong, Hongseok Namkoong", "abstract": "Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented. To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations. We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect. Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly. On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts.", "sentences": ["Assessing External Validity Over Worst-case Subpopulations.", "Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented.", "To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations.", "We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect.", "Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly.", "On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts."]}
{"id": "http://arxiv.org/abs/2007.10653", "title": "Accounting for Unobserved Confounding in Domain Generalization.", "authors": "Alexis Bellot, Mihaela van der Schaar", "abstract": "This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model. Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem. Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions. Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters. We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data.", "sentences": ["Accounting for Unobserved Confounding in Domain Generalization.", "This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model.", "Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem.", "Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions.", "Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters.", "We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data."]}
{"id": "http://arxiv.org/abs/2008.03626", "title": "Directed hypergraph neural network.", "authors": "Loc Hoang Tran, Linh Hoang Tran", "abstract": "To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists. However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph. In this paper, we will present the novel neural network method for directed hypergraph. In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method. These methods are employed to solve the node classification task. The two datasets that are used in the experiments are the cora and the citeseer datasets. Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies.", "sentences": ["Directed hypergraph neural network.", "To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists.", "However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph.", "In this paper, we will present the novel neural network method for directed hypergraph.", "In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method.", "These methods are employed to solve the node classification task.", "The two datasets that are used in the experiments are the cora and the citeseer datasets.", "Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies."]}
{"id": "http://arxiv.org/abs/2102.01934", "title": "Noise-robust classification with hypergraph neural network.", "authors": "Nguyen Trinh Vu Dang, Loc Tran, Linh Tran", "abstract": "This paper presents a novel version of the hypergraph neural network method. This method is utilized to solve the noisy label learning problem. First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the \"noise\" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method. Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem. The accuracies of these five methods are evaluated and compared. Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases. Moreover, the hypergraph neural network methods are at least as good as the graph neural network.", "sentences": ["Noise-robust classification with hypergraph neural network.", "This paper presents a novel version of the hypergraph neural network method.", "This method is utilized to solve the noisy label learning problem.", "First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the \"noise\" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method.", "Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem.", "The accuracies of these five methods are evaluated and compared.", "Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases.", "Moreover, the hypergraph neural network methods are at least as good as the graph neural network."]}
{"id": "http://arxiv.org/abs/2103.13466", "title": "Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case.", "authors": "Benoit Collins, Tomohiro Hayase", "abstract": "Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics. FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians. However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far. The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers. Haar distributed orthogonal matrices are essential for achieving dynamical isometry. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key of the proof is an invariance of the MLP. Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact.", "sentences": ["Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case.", "Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics.", "FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians.", "However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far.", "The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers.", "Haar distributed orthogonal matrices are essential for achieving dynamical isometry.", "In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case.", "A key of the proof is an invariance of the MLP.", "Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change.", "Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement.", "Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact."]}
{"id": "http://arxiv.org/abs/2104.12199", "title": "Sampling Permutations for Shapley Value Estimation.", "authors": "Rory Mitchell, Joshua Cooper, Eibe Frank, Geoffrey Holmes", "abstract": "Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.", "sentences": ["Sampling Permutations for Shapley Value Estimation.", "Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models.", "As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation.", "Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations.", "To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically.", "First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature.", "The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations.", "Second, we exploit connections between the hypersphere $\\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties.", "Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations."]}
{"id": "http://arxiv.org/abs/2105.02522", "title": "Neural graphical modelling in continuous-time: consistency guarantees and algorithms.", "authors": "Alexis Bellot, Kim Branson, Mihaela van der Schaar", "abstract": "The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based structure learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.", "sentences": ["Neural graphical modelling in continuous-time: consistency guarantees and algorithms.", "The discovery of structure from time series data is a key problem in fields of study working with complex systems.", "Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time.", "Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling.", "In this paper, we consider score-based structure learning for the study of dynamical systems.", "We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations.", "Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems."]}
{"id": "http://arxiv.org/abs/2105.14172", "title": "A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering.", "authors": "Suyun Liu, Luis Nunes Vicente", "abstract": "In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates. The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets.", "sentences": ["A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering.", "In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness.", "A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth.", "To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates.", "The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function.", "Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets."]}
{"id": "http://arxiv.org/abs/2106.03023", "title": "Hierarchical Bayesian Mixture Models for Time Series Using Context Trees as State Space Partitions.", "authors": "Ioannis Papageorgiou, Ioannis Kontoyiannis", "abstract": "A general Bayesian framework is introduced for mixture modelling and inference with real-valued time series. At the top level, the state space is partitioned via the choice of a discrete context tree, so that the resulting partition depends on the values of some of the most recent samples. At the bottom level, a different model is associated with each region of the partition. This defines a very rich and flexible class of mixture models, for which we provide algorithms that allow for efficient, exact Bayesian inference. In particular, we show that the maximum a posteriori probability (MAP) model (including the relevant MAP context tree partition) can be precisely identified, along with its exact posterior probability. The utility of this general framework is illustrated in detail when a different autoregressive (AR) model is used in each state-space region, resulting in a mixture-of-AR model class. The performance of the associated algorithmic tools is demonstrated in the problems of model selection and forecasting on both simulated and real-world data, where they are found to provide results as good or better than state-of-the-art methods.", "sentences": ["Hierarchical Bayesian Mixture Models for Time Series Using Context Trees as State Space Partitions.", "A general Bayesian framework is introduced for mixture modelling and inference with real-valued time series.", "At the top level, the state space is partitioned via the choice of a discrete context tree, so that the resulting partition depends on the values of some of the most recent samples.", "At the bottom level, a different model is associated with each region of the partition.", "This defines a very rich and flexible class of mixture models, for which we provide algorithms that allow for efficient, exact Bayesian inference.", "In particular, we show that the maximum a posteriori probability (MAP) model (including the relevant MAP context tree partition) can be precisely identified, along with its exact posterior probability.", "The utility of this general framework is illustrated in detail when a different autoregressive (AR) model is used in each state-space region, resulting in a mixture-of-AR model class.", "The performance of the associated algorithmic tools is demonstrated in the problems of model selection and forecasting on both simulated and real-world data, where they are found to provide results as good or better than state-of-the-art methods."]}
{"id": "http://arxiv.org/abs/2106.05414", "title": "A Reputation Game Simulation: Emergent Social Phenomena from Information Theory.", "authors": "Torsten En\u00dflin, Viktoria Kainz, C\u00e9line B\u0153hm", "abstract": "Reputation is a central element of social communications, be it with human or artificial intelligence (AI), and as such can be the primary target of malicious communication strategies. There is already a vast amount of literature on trust networks addressing this issue and proposing ways to simulate these networks dynamics using Bayesian principles and involving Theory of Mind models. The main issue for these simulations is usually the amount of information that can be stored and is usually solved by discretising variables and using hard thresholds. Here we propose a novel approach to the way information is updated that accounts for knowledge uncertainty and is closer to reality. In our game, agents use information compression techniques to capture their complex environment and store it in their finite memories. The loss of information that results from this leads to emergent phenomena, such as echo chambers, self-deception, deception symbiosis, and freezing of group opinions. Various malicious strategies of agents are studied for their impact on group sociology, like sycophancy, egocentricity, pathological lying, and aggressiveness. Even though our modeling could be made more complex, our set-up can already provide insights into social interactions and can be used to investigate the effects of various communication strategies and find ways to counteract malicious ones. Eventually this work should help to safeguard the design of non-abusive AI systems.", "sentences": ["A Reputation Game Simulation: Emergent Social Phenomena from Information Theory.", "Reputation is a central element of social communications, be it with human or artificial intelligence (AI), and as such can be the primary target of malicious communication strategies.", "There is already a vast amount of literature on trust networks addressing this issue and proposing ways to simulate these networks dynamics using Bayesian principles and involving Theory of Mind models.", "The main issue for these simulations is usually the amount of information that can be stored and is usually solved by discretising variables and using hard thresholds.", "Here we propose a novel approach to the way information is updated that accounts for knowledge uncertainty and is closer to reality.", "In our game, agents use information compression techniques to capture their complex environment and store it in their finite memories.", "The loss of information that results from this leads to emergent phenomena, such as echo chambers, self-deception, deception symbiosis, and freezing of group opinions.", "Various malicious strategies of agents are studied for their impact on group sociology, like sycophancy, egocentricity, pathological lying, and aggressiveness.", "Even though our modeling could be made more complex, our set-up can already provide insights into social interactions and can be used to investigate the effects of various communication strategies and find ways to counteract malicious ones.", "Eventually this work should help to safeguard the design of non-abusive AI systems."]}
{"id": "http://arxiv.org/abs/2106.08161", "title": "Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness.", "authors": "Adam Foster, \u00c1rpi Vez\u00e9r, Craig A Glastonbury, P\u00e1id\u00ed Creed, Sam Abujudeh, Aaron Sim", "abstract": "Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.", "sentences": ["Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness.", "Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology.", "Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges.", "We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space.", "We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions.", "We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data.", "We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field."]}
{"id": "http://arxiv.org/abs/2106.08902", "title": "Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits.", "authors": "Avishek Ghosh, Abishek Sankararaman, Kannan Ramchandran", "abstract": "We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\\mathcal{O}(\\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\\mathcal{O}(T^{\\frac{1}{2} + \\varepsilon}/(N)^{\\frac{1}{2} -\\varepsilon})$ if its cluster is not well separated, where $\\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\\epsilon_i$, attains a regret scaling of $\\widetilde{O}(\\epsilon_i\\sqrt{T})$. This demonstrates that if the user representations are close (small $\\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.", "sentences": ["Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits.", "We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical.", "We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average.", "In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization.", "We show that, for any agent, the regret scales as $\\mathcal{O}(\\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\\mathcal{O}(T^{\\frac{1}{2} + \\varepsilon}/(N)^{\\frac{1}{2} -\\varepsilon})$ if its cluster is not well separated, where $\\varepsilon$ is positive and arbitrarily close to $0$.", "Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity.", "In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model.", "We show that, an agent $i$ whose parameter deviates from the population average by $\\epsilon_i$, attains a regret scaling of $\\widetilde{O}(\\epsilon_i\\sqrt{T})$.", "This demonstrates that if the user representations are close (small $\\epsilon_i)$, the resulting regret is low, and vice-versa.", "The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines."]}
{"id": "http://arxiv.org/abs/2106.10771", "title": "Multirate Training of Neural Networks.", "authors": "Tiffany Vlaar, Benedict Leimkuhler", "abstract": "We propose multirate training of neural networks: partitioning neural network parameters into \"fast\" and \"slow\" parts which are trained on different time scales. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.", "sentences": ["Multirate Training of Neural Networks.", "We propose multirate training of neural networks: partitioning neural network parameters into \"fast\" and \"slow\" parts which are trained on different time scales.", "By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks.", "We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models.", "We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD.", "We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch.", "A multirate approach can be used to learn different features present in the data and as a form of regularization.", "Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area."]}
{"id": "http://arxiv.org/abs/2106.16004", "title": "What can linear interpolation of neural network loss landscapes tell us?.", "authors": "Tiffany Vlaar, Jonathan Frankle", "abstract": "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model. Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.", "sentences": ["What can linear interpolation of neural network loss landscapes tell us?.", "Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems.", "Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion.", "One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization.", "On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem.", "In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices.", "Further, we use linear interpolation to study the role played by individual layers and substructures of the network.", "We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model.", "Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization."]}
{"id": "http://arxiv.org/abs/2107.05802", "title": "How many degrees of freedom do we need to train deep networks: a loss landscape perspective.", "authors": "Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli", "abstract": "A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.", "sentences": ["How many degrees of freedom do we need to train deep networks: a loss landscape perspective.", "A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters.", "We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality.", "We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold.", "This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases.", "We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape.", "In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large.", "In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large.", "Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces.", "Code is available at https://github.com/ganguli-lab/degrees-of-freedom."]}
{"id": "http://arxiv.org/abs/2108.13657", "title": "Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements.", "authors": "Corinne Emmenegger, Peter B\u00fchlmann", "abstract": "Traditionally, spline or kernel approaches in combination with parametric estimation are used to infer the linear coefficient (fixed effects) in a partially linear mixed-effects model for repeated measurements. Using machine learning algorithms allows us to incorporate complex interaction structures and high-dimensional variables. We employ double machine learning to cope with the nonparametric part of the partially linear mixed-effects model: the nonlinear variables are regressed out nonparametrically from both the linear variables and the response. This adjustment can be performed with any machine learning algorithm, for instance random forests, which allows to take complex interaction terms and nonsmooth structures into account. The adjusted variables satisfy a linear mixed-effects model, where the linear coefficient can be estimated with standard linear mixed-effects techniques. We prove that the estimated fixed effects coefficient converges at the parametric rate, is asymptotically Gaussian distributed, and semiparametrically efficient. Two simulation studies demonstrate that our method outperforms a penalized regression spline approach in terms of coverage. We also illustrate our proposed approach on a longitudinal dataset with HIV-infected individuals. Software code for our method is available in the R-package dmlalg.", "sentences": ["Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements.", "Traditionally, spline or kernel approaches in combination with parametric estimation are used to infer the linear coefficient (fixed effects) in a partially linear mixed-effects model for repeated measurements.", "Using machine learning algorithms allows us to incorporate complex interaction structures and high-dimensional variables.", "We employ double machine learning to cope with the nonparametric part of the partially linear mixed-effects model: the nonlinear variables are regressed out nonparametrically from both the linear variables and the response.", "This adjustment can be performed with any machine learning algorithm, for instance random forests, which allows to take complex interaction terms and nonsmooth structures into account.", "The adjusted variables satisfy a linear mixed-effects model, where the linear coefficient can be estimated with standard linear mixed-effects techniques.", "We prove that the estimated fixed effects coefficient converges at the parametric rate, is asymptotically Gaussian distributed, and semiparametrically efficient.", "Two simulation studies demonstrate that our method outperforms a penalized regression spline approach in terms of coverage.", "We also illustrate our proposed approach on a longitudinal dataset with HIV-infected individuals.", "Software code for our method is available in the R-package dmlalg."]}
{"id": "http://arxiv.org/abs/2109.14569", "title": "Partitioning Cloud-based Microservices (via Deep Learning).", "authors": "Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic", "abstract": "Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices.  Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are \"brittle\"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals.  In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization.  To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.", "sentences": ["Partitioning Cloud-based Microservices (via Deep Learning).", "Cloud-based software has many advantages.", "When services are divided into many independent components, they are easier to update.", "Also, during peak demand, it is easier to scale cloud services (just hire more CPUs).", "Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices.", "Recently there has been much work using machine learning to simplify this partitioning task.", "Despite much research, no single partitioning method can be recommended as generally useful.", "More specifically, those prior solutions are \"brittle\"; i.e.", "if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals.", "In order to find a generally useful partitioning method, we propose DEEPLY.", "This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization.", "As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals.", "To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization.", "To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB."]}
{"id": "http://arxiv.org/abs/2110.06048", "title": "The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control.", "authors": "Jasin Machkour, Michael Muma, Daniel P. Palomar", "abstract": "We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated knockoff predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate probability distribution with existing finite expectation and variance. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. An open source R package containing the implementation of the T-Knock filter is available at https://github.com/jasinmachkour/tknock.", "sentences": ["The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control.", "We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data.", "The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables.", "This is achieved by fusing the solutions of multiple early terminated random experiments.", "The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated knockoff predictors.", "A finite sample proof based on martingale theory for the FDR control property is provided.", "Numerical simulations show that the FDR is controlled at the target level while allowing for a high power.", "We prove under mild conditions that the knockoffs can be sampled from any univariate probability distribution with existing finite expectation and variance.", "The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings.", "The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.", "An open source R package containing the implementation of the T-Knock filter is available at https://github.com/jasinmachkour/tknock."]}
{"id": "http://arxiv.org/abs/2110.06914", "title": "What Happens after SGD Reaches Zero Loss? --A Mathematical Framework.", "authors": "Zhiyuan Li, Tianhao Wang, Sanjeev Arora", "abstract": "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020).", "sentences": ["What Happens after SGD Reaches Zero Loss? --A Mathematical Framework.", "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold.", "Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence.", "In such a regime, Blanc et al.", "(2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$.", "The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).", "It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance.", "This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al.", "(2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance.", "As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples.", "This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020)."]}
{"id": "http://arxiv.org/abs/2110.12087", "title": "Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds.", "authors": "Vu Nguyen, Marc Peter Deisenroth, Michael A. Osborne", "abstract": "Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.", "sentences": ["Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds.", "Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions.", "In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known.", "More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO).", "That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior.", "To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints.", "We characterize the sample variance bounds and show that the decision made by BES is explainable.", "Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization."]}
{"id": "http://arxiv.org/abs/2110.13970", "title": "Rademacher Random Projections with Tensor Networks.", "authors": "Beheshteh T. Rakhshan, Guillaume Rabusseau", "abstract": "Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors. Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution. Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size. Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30]. In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map", "sentences": ["Rademacher Random Projections with Tensor Networks.", "Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors.", "Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution.", "Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size.", "Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30].", "In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map"]}
{"id": "http://arxiv.org/abs/2111.15379", "title": "Text classification problems via BERT embedding method and graph convolutional neural network.", "authors": "Loc Hoang Tran, Tuan Tran, An Mai", "abstract": "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network. This combination is employed to solve the text classification problem. Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector. Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels. Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models.", "sentences": ["Text classification problems via BERT embedding method and graph convolutional neural network.", "This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network.", "This combination is employed to solve the text classification problem.", "Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector.", "Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels.", "Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models."]}
{"id": "http://arxiv.org/abs/2112.03638", "title": "Scaling Structured Inference with Randomization.", "authors": "Yao Fu, John P. Cunningham, Mirella Lapata", "abstract": "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity. At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums. Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude. We further achieve low bias and variance via Rao-Blackwellization and importance sampling. Experiments over different graphs demonstrate the accuracy and efficiency of our approach. Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse", "sentences": ["Scaling Structured Inference with Randomization.", "Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity.", "At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums.", "Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states.", "Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs).", "It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers.", "Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude.", "We further achieve low bias and variance via Rao-Blackwellization and importance sampling.", "Experiments over different graphs demonstrate the accuracy and efficiency of our approach.", "Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse"]}
{"id": "http://arxiv.org/abs/2201.00001", "title": "Modeling Advection on Directed Graphs using Mat\\'ern Gaussian Processes for Traffic Flow.", "authors": "Danielle C Maddix, Nadim Saad, Yuyang Wang", "abstract": "The transport of traffic flow can be modeled by the advection equation. Finite difference and finite volumes methods have been used to numerically solve this hyperbolic equation on a mesh. Advection has also been modeled discretely on directed graphs using the graph advection operator [4, 18]. In this paper, we first show that we can reformulate this graph advection operator as a finite difference scheme. We then propose the Directed Graph Advection Mat\\'ern Gaussian Process (DGAMGP) model that incorporates the dynamics of this graph advection operator into the kernel of a trainable Mat\\'ern Gaussian Process to effectively model traffic flow and its uncertainty as an advective process on a directed graph.", "sentences": ["Modeling Advection on Directed Graphs using Mat\\'ern Gaussian Processes for Traffic Flow.", "The transport of traffic flow can be modeled by the advection equation.", "Finite difference and finite volumes methods have been used to numerically solve this hyperbolic equation on a mesh.", "Advection has also been modeled discretely on directed graphs using the graph advection operator [4, 18].", "In this paper, we first show that we can reformulate this graph advection operator as a finite difference scheme.", "We then propose the Directed Graph Advection Mat\\'ern Gaussian Process (DGAMGP) model that incorporates the dynamics of this graph advection operator into the kernel of a trainable Mat\\'ern Gaussian Process to effectively model traffic flow and its uncertainty as an advective process on a directed graph."]}
{"id": "http://arxiv.org/abs/2201.08712", "title": "Improved Random Features for Dot Product Kernels.", "authors": "Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone", "abstract": "Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.", "sentences": ["Improved Random Features for Dot Product Kernels.", "Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems.", "We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning.", "First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features.", "We show empirically that the use of complex features can significantly reduce the variances of these approximations.", "Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances.", "These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features.", "Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel.", "We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets."]}
{"id": "http://arxiv.org/abs/2201.13195", "title": "Memory-Efficient Backpropagation through Large Linear Layers.", "authors": "Daniel Bershatsky, Aleksandr Mikhalev, Alexandr Katrutsa, Julia Gusak, Daniil Merkulov, Ivan Oseledets", "abstract": "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.", "sentences": ["Memory-Efficient Backpropagation through Large Linear Layers.", "In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass.", "This study proposes a memory reduction approach to perform backpropagation through linear layers.", "Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy.", "Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication.", "We compare this variance with the variance coming from gradient estimation based on the batch of samples.", "We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks."]}
{"id": "http://arxiv.org/abs/2202.00824", "title": "KSD Aggregated Goodness-of-fit Test.", "authors": "Antonin Schrab, Benjamin Guedj, Arthur Gretton", "abstract": "We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.", "sentences": ["KSD Aggregated Goodness-of-fit Test.", "We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD).", "We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels.", "KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels.", "We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term.", "KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections.", "In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting.", "We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures."]}
{"id": "http://arxiv.org/abs/2202.00834", "title": "Algorithms for Efficiently Learning Low-Rank Neural Networks.", "authors": "Kiran Vodrahalli, Rakesh Shivanna, Maheswaran Sathiamoorthy, Sagar Jain, Ed H. Chi", "abstract": "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable. To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet.", "sentences": ["Algorithms for Efficiently Learning Low-Rank Neural Networks.", "We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices.", "First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\\epsilon$ with probability $\\ge 1 - \\delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples.", "Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable.", "To solve this problem, we introduce an efficient SVD-based $\\textit{Nonlinear Kernel Projection}$ algorithm for solving a nonlinear low-rank approximation problem over Gaussian space.", "Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank $\\textit{deep}$ networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension).", "Finally, we validate our theory by training ResNet and EfficientNet models on ImageNet."]}
{"id": "http://arxiv.org/abs/2202.01147", "title": "Improving Screening Processes via Calibrated Subset Selection.", "authors": "Lequn Wang, Thorsten Joachims, Manuel Gomez Rodriguez", "abstract": "Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of our algorithm that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines.", "sentences": ["Improving Screening Processes via Calibrated Subset Selection.", "Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates.", "In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained.", "We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal.", "Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation.", "Moreover, we show that a variant of our algorithm that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees.", "Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines."]}