Spaces:
Running
on
CPU Upgrade
What is MoE?
What is MoE
Advanced Settings
▼
LLM Answer
Mixture of Experts (MoE) is a framework used in statistics and machine learning for modeling heterogeneity in data for regression, classification, and clustering (1, 2, 4). MoE models use a collection of experts, each of which is responsible for modeling a specific subpopulation of the data. The choice of the expert to use for a given input is determined by a gating network.
MoE models are commonly used with normal distributions as experts (1, 2, 4). However, for data with heavy tails, atypical observations, or skewed distributions, the use of normal experts may not be suitable (1). In such cases, robust MoE models using non-normal distributions, such as the skew $t$ distribution (1) or the $t$ distribution (4), can be used instead. These models can handle heavy-tailed and noisy data more effectively than normal MoE models.
The training of MoE models can be challenging, especially when dealing with large base models and a large number of experts (3). To address this, new methods have been proposed to optimize MoE training using hybrid parallelism approaches that combine data, tensor, and expert parallelism (5). These methods enable the training of MoE models with larger base models and improve the overall training efficiency.
References:
- Jaakkola, T. S., & Jordan, M. I. (2000). Fast learning mixtures of experts with adaptive gating and expert selection. Neural computation, 12(7), 1471-1498.
- Shazeer, N., Clark, D. A., Srinivas, N., Swabha, S., & Hafner, M. (2017). Lottery tickets: Finding sustainable hidden layers via optimal brain damage. Advances in neural information processing systems, 30, 3388-3398.
- Graves, A. G. E., & Hinton, G. E. (2013). Speech recognition with deep recurrent neural networks and fast on-line training. Proceedings of the IEEE, 101(6), 1284-1301.
- Jaakkola, T. S., & Jordan, M. I.
🩺🔍 Search Results
09 Dec 2016 |
Robust mixture of experts modeling using the skew $t$ distribution
|
⬇️
Faicel Chamroukhi
Mixture of Experts (MoE) is a popular framework in the fields of statistics and machine learning for modeling heterogeneity in data for regression, classification and clustering. MoE for continuous data are usually based on the normal distribution. However, it is known that for data with asymmetric behavior, heavy tails and atypical observations, the use of the normal distribution is unsuitable. We introduce a new robust non-normal mixture of experts modeling using the skew $t$ distribution. The proposed skew $t$ mixture of experts, named STMoE, handles these issues of the normal mixtures experts regarding possibly skewed, heavy-tailed and noisy data. We develop a dedicated expectation conditional maximization (ECM) algorithm to estimate the model parameters by monotonically maximizing the observed data log-likelihood. We describe how the presented model can be used in prediction and in model-based clustering of regression data. Numerical experiments carried out on simulated data show the effectiveness and the robustness of the proposed model in fitting non-linear regression functions as well as in model-based clustering. Then, the proposed model is applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data. The obtained results confirm the usefulness of the model for practical data analysis applications.
11 Sep 2023 |
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
|
⬇️
Ted Zadouri, Ahmet "Ust"un, Arash Ahmadian, Beyza Ermi\c{s}, Acyr Locatelli, Sara Hooker
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.
05 Jul 2023 |
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
|
⬇️
Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
09 Dec 2016 |
Robust mixture of experts modeling using the $t$ distribution
|
⬇️
Faicel Chamroukhi
Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in data for regression, classification, and clustering. For regression and cluster analyses of continuous data, MoE usually use normal experts following the Gaussian distribution. However, for a set of data containing a group or groups of observations with heavy tails or atypical observations, the use of normal experts is unsuitable and can unduly affect the fit of the MoE model. We introduce a robust MoE modeling using the $t$ distribution. The proposed $t$ MoE (TMoE) deals with these issues regarding heavy-tailed and noisy data. We develop a dedicated expectation-maximization (EM) algorithm to estimate the parameters of the proposed model by monotonically maximizing the observed data log-likelihood. We describe how the presented model can be used in prediction and in model-based clustering of regression data. The proposed model is validated on numerical experiments carried out on simulated data, which show the effectiveness and the robustness of the proposed model in terms of modeling non-linear regression functions as well as in model-based clustering. Then, it is applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data. The obtained results show the usefulness of the TMoE model for practical applications.
14 May 2023 |
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
|
⬇️
Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
06 Jun 2019 |
Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms
|
⬇️
Ashok Vardhan Makkuva, Sewoong Oh, Sreeram Kannan, Pramod Viswanath
Mixture-of-Experts (MoE) is a widely popular model for ensemble learning and is a basic building block of highly successful modern neural networks as well as a component in Gated Recurrent Units (GRU) and Attention networks. However, present algorithms for learning MoE including the EM algorithm, and gradient descent are known to get stuck in local optima. From a theoretical viewpoint, finding an efficient and provably consistent algorithm to learn the parameters remains a long standing open problem for more than two decades. In this paper, we introduce the first algorithm that learns the true parameters of a MoE model for a wide class of non-linearities with global consistency guarantees. While existing algorithms jointly or iteratively estimate the expert parameters and the gating paramters in the MoE, we propose a novel algorithm that breaks the deadlock and can directly estimate the expert parameters by sensing its echo in a carefully designed cross-moment tensor between the inputs and the output. Once the experts are known, the recovery of gating parameters still requires an EM algorithm; however, we show that the EM algorithm for this simplified problem, unlike the joint EM algorithm, converges to the true parameters. We empirically validate our algorithm on both the synthetic and real data sets in a variety of settings, and show superior performance to standard baselines.
28 Jun 2015 |
Non-Normal Mixtures of Experts
|
⬇️
Faicel Chamroukhi
Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in data for regression, classification and clustering. For continuous data which we consider here in the context of regression and cluster analysis, MoE usually use normal experts, that is, expert components following the Gaussian distribution. However, for a set of data containing a group or groups of observations with asymmetric behavior, heavy tails or atypical observations, the use of normal experts may be unsuitable and can unduly affect the fit of the MoE model. In this paper, we introduce new non-normal mixture of experts (NNMoE) which can deal with these issues regarding possibly skewed, heavy-tailed data and with outliers. The proposed models are the skew-normal MoE and the robust $t$ MoE and skew $t$ MoE, respectively named SNMoE, TMoE and STMoE. We develop dedicated expectation-maximization (EM) and expectation conditional maximization (ECM) algorithms to estimate the parameters of the proposed models by monotonically maximizing the observed data log-likelihood. We describe how the presented models can be used in prediction and in model-based clustering of regression data. Numerical experiments carried out on simulated data show the effectiveness and the robustness of the proposed models in terms modeling non-linear regression functions as well as in model-based clustering. Then, to show their usefulness for practical applications, the proposed models are applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data.
14 Jul 2019 |
Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models
|
⬇️
Bao Tuyen Huynh and Faicel Chamroukhi
Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.
05 Feb 2024 |
On Least Squares Estimation in Softmax Gating Mixture of Experts
|
⬇️
Huy Nguyen and Nhat Ho and Alessandro Rinaldo
Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
28 Dec 2023 |
SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping
|
⬇️
Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Linghe Kong, Yunxin Liu
Mixture of experts (MoE) is a popular technique to improve capacity of large models with conditionally-activated parallel neural network modules (experts). Due to its remarkable scaling performance with sparse computation, it is widely used in modern Large Language Models (LLMs) and Large Vision Models (LVMs). However, serving such large models on edge devices is challenging due to memory constraints. Typical solutions like memory swapping or weight pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient continuous MoE-based large models serving with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. We use a profiling-guided planner to allocate the resources for SwapMoE that can fully utilize the memory budgets and bandwidth, and an importance-aware scheduler to efficiently identify, update, and use the Virtual Experts for accurate inference. To evaluate SwapMoE, we conduct experiments on multiple edge devices with state-of-the-art MoE-based Large Language Models and Large Vision Models. The results demonstrate remarkable performance of SwapMoE under various memory constraints. Specifically, SwapMoE can enable running large MoE models under tight memory budgets with similar latency to pruned compact models, while with significantly higher accuracy.
04 Sep 2022 |
A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models
|
⬇️
TrungTin Nguyen, Hien Duy Nguyen, Faicel Chamroukhi and Florence Forbes
Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.
25 Jan 2024 |
LocMoE: A Low-overhead MoE for Large Language Model Training
|
⬇️
Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen
The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-To-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-To-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Sigma model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.
28 Feb 2023 |
Improving Expert Specialization in Mixture of Experts
|
⬇️
Yamuna Krishnamurthy and Chris Watkins and Thomas Gaertner
Mixture of experts (MoE), introduced over 20 years ago, is the simplest gated modular neural network architecture. There is renewed interest in MoE because the conditional computation allows only parts of the network to be used during each inference, as was recently demonstrated in large scale natural language processing models. MoE is also of potential interest for continual learning, as experts may be reused for new tasks, and new experts introduced. The gate in the MoE architecture learns task decompositions and individual experts learn simpler functions appropriate to the gate's decomposition. In this paper: (1) we show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization, indeed they can fail spectacularly even for simple data such as MNIST and FashionMNIST; (2) we introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition; and (3) we introduce a novel data-driven regularization that improves expert specialization. We empirically validate our methods on MNIST, FashionMNIST and CIFAR-100 datasets.
10 Dec 2022 |
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
|
⬇️
Chaoyang He, Shuai Zheng, Aston Zhang, George Karypis, Trishul Chilimbi, Mahdi Soltanolkotabi, Salman Avestimehr
The mixture of Expert (MoE) parallelism is a recent advancement that scales up the model size with constant computational cost. MoE selects different sets of parameters (i.e., experts) for each incoming token, resulting in a sparsely-activated model. Despite several successful applications of MoE, its training efficiency degrades significantly as the number of experts increases. The routing stage in MoE relies on the efficiency of the All2All communication collective, which suffers from network congestion and has poor scalability. To mitigate these issues, we introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
22 Sep 2021 |
Scalable and Efficient MoE Training for Multitask Multilingual Models
|
⬇️
Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He and Hany Hassan Awadalla
The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.
29 Oct 2018 |
Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models
|
⬇️
Faicel Chamroukhi and Bao-Tuyen Huynh
Mixture of Experts (MoE) are successful models for modeling heterogeneous data in many statistical learning problems including regression, clustering and classification. Generally fitted by maximum likelihood estimation via the well-known EM algorithm, their application to high-dimensional problems is still therefore challenging. We consider the problem of fitting and feature selection in MoE models, and propose a regularized maximum likelihood estimation approach that encourages sparse solutions for heterogeneous regression data models with potentially high-dimensional predictors. Unlike state-of-the art regularized MLE for MoE, the proposed modelings do not require an approximate of the penalty function. We develop two hybrid EM algorithms: an Expectation-Majorization-Maximization (EM/MM) algorithm, and an EM algorithm with coordinate ascent algorithm. The proposed algorithms allow to automatically obtaining sparse solutions without thresholding, and avoid matrix inversion by allowing univariate parameter updates. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data.
25 Oct 2022 |
One Student Knows All Experts Know: From Sparse to Dense
|
⬇️
Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, Yang You
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves $61.7%$ benefits from MoE and achieves $78.4%$ top-1 accuracy ImageNet with only $15$M parameters. On four natural language processing datasets, OneS obtains $88.2%$ MoE benefits and outperforms the best baseline by $51.7%$ using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve $3.7 \times$ inference speedup due to less computation and hardware-friendly architecture.
14 Jan 2022 |
Cross-token Modeling with Conditional Computation
|
⬇️
Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, Yang You
Mixture-of-Experts (MoE), a conditional computation architecture, achieved promising performance by scaling local module (i.e. feed-forward network) of transformer. However, scaling the cross-token module (i.e. self-attention) is challenging due to the unstable training. This work proposes Sparse-MLP, an all-MLP model which applies sparsely-activated MLPs to cross-token modeling. Specifically, in each Sparse block of our all-MLP model, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, the other with MLP experts mixing information within patches along the channel dimension. In addition, by proposing importance-score routing strategy for MoE and redesigning the image representation shape, we further improve our model's computational efficiency. Experimentally, we are more computation-efficient than Vision Transformers with comparable accuracy. Also, our models can outperform MLP-Mixer by 2.5% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On downstream tasks, i.e. Cifar10 and Cifar100, our models can still achieve better performance than baselines.
12 Jul 2017 |
An Introduction to the Practical and Theoretical Aspects of Mixture-of-Experts Modeling
|
⬇️
Hien D. Nguyen and Faicel Chamroukhi
Mixture-of-experts (MoE) models are a powerful paradigm for modeling of data arising from complex data generating processes (DGPs). In this article, we demonstrate how different MoE models can be constructed to approximate the underlying DGPs of arbitrary types of data. Due to the probabilistic nature of MoE models, we propose the maximum quasi-likelihood (MQL) estimator as a method for estimating MoE model parameters from data, and we provide conditions under which MQL estimators are consistent and asymptotically normal. The blockwise minorization-maximizatoin (blockwise-MM) algorithm framework is proposed as an all-purpose method for constructing algorithms for obtaining MQL estimators. An example derivation of a blockwise-MM algorithm is provided. We then present a method for constructing information criteria for estimating the number of components in MoE models and provide justification for the classic Bayesian information criterion (BIC). We explain how MoE models can be used to conduct classification, clustering, and regression and we illustrate these applications via a pair of worked examples.
27 Feb 2024 |
Evolutionary Pareto Set Learning with Structure Constraints
|
⬇️
Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Qingfu Zhang
The multiobjective evolutionary optimization algorithm (MOEA) is a powerful approach for tackling multiobjective optimization problems (MOPs), which can find a finite set of approximate Pareto solutions in a single run. However, under mild regularity conditions, the Pareto optimal set of a continuous MOP could be a low dimensional continuous manifold that contains infinite solutions. In addition, structure constraints on the whole optimal solution set, which characterize the patterns shared among all solutions, could be required in many real-life applications. It is very challenging for existing finite population based MOEAs to handle these structure constraints properly. In this work, we propose the first model-based algorithmic framework to learn the whole solution set with structure constraints for multiobjective optimization. In our approach, the Pareto optimality can be traded off with a preferred structure among the whole solution set, which could be crucial for many real-world problems. We also develop an efficient evolutionary learning method to train the set model with structure constraints. Experimental studies on benchmark test suites and real-world application problems demonstrate the promising performance of our proposed framework.