diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..b6ca89e3 --- /dev/null +++ b/404.html @@ -0,0 +1,2188 @@ + + + +
+ + + + + + + + + + + + + + +In the complex landscape of multi-task learning, AdaMerging has emerged as a potent method for adaptively merging model parameters to optimize performance across tasks. Unlike traditional fixed-coefficient methods, AdaMerging autonomously learns merging coefficients, offering a more refined and responsive approach1.
+The cornerstone of AdaMerging lies in its adaptive nature, where it learns the coefficients for merging either on a task-wise or layer-wise basis. This adaptability is driven by an entropy minimization strategy applied to unlabeled test samples as a surrogate objective function, which serves to refine the merging coefficients for optimal performance.
+Task-wise AdaMerging is formulated as:
+where \(\lambda_i\) represents the merging coefficient for the \(i\)-th task, and \(\tau_i\) denotes the task vector for the \(i\)-th task.
+On the other hand, Layer-wise AdaMerging is articulated as:
+where the merging coefficient \(\lambda^{l}_{i}\) and task vector \(\tau^{l}_{i}\) are specific to each layer \(l\) of the model.
+By leveraging this adaptive learning approach, AdaMerging significantly enhances the model's ability to generalize across tasks and layers, resulting in a more robust and finely-tuned performance profile. The method’s reliance on entropy minimization ensures that the merging process continually seeks the most informative and stable configuration, adapting to the specific needs of the dataset and tasks at hand.
+Task-wise Coefficients. +The below Figure shows the changes during the iteration process of merging coefficient optimization of each task vector in Task-wise AdaMerging and AdaMerging++, which is shown every ten steps. We consistently observe that the merging coefficients of each task vector are inconsistent. When the number of tasks is relatively large, it is obviously undesirable to grid search the coefficients of each task, but our AdaMerging avoids this manual search process.
+ +Layer-wise Coefficients. +The following Figure shows the merging coefficients learned by Layer-wise AdaMerging and AdaMerging++ on ViT-B/32 respectively. We observed that:
+Merge CLIP-ViT-B/32 models from eight downstream image classification tasks:
+fusion_bench \
+ method=adamerging \
+ method.name=clip_layer_wise_adamerging \
+ method.save_merging_weights=merging_weights.pt \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ fabric_logger.root_dir=outputs/logs/ViT-B-32 \
+ fabric_logger.name=clip_layer_wise_adamerging_adam
+
Part of the output:
+Profiler Report
+
+----------------------------------------------------------------------------------------------------------------------------------
+| Action | Mean duration (s) | Num calls | Total time (s) | Percentage % |
+----------------------------------------------------------------------------------------------------------------------------------
+| Total | - | 26001 | 724.65 | 100 % |
+----------------------------------------------------------------------------------------------------------------------------------
+| backward pass | 0.060172 | 8000 | 481.38 | 66.429 |
+| forward pass | 0.016124 | 8000 | 128.99 | 17.801 |
+| data loading | 0.0063443 | 8000 | 50.754 | 7.004 |
+| merging weights | 0.050735 | 1000 | 50.735 | 7.0013 |
+| construct the wrapped model | 7.2558 | 1 | 7.2558 | 1.0013 |
+| optimizer step | 0.00098186 | 1000 | 0.98186 | 0.13549 |
+----------------------------------------------------------------------------------------------------------------------------------
+
task_wise_adamerging
+
+
+¶
TaskWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
fusion_bench/method/adamerging/task_wise_adamerging.py
38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 |
|
compute_logits(module, batch, task)
+
+
+ abstractmethod
+
+
+¶Compute the logits for the given batch and task.
+ + +Parameters:
+module
+¶Module
)
+ –
+ The model module.
+batch
+¶tuple
)
+ –
+ A batch of input data.
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+Tensor
( Tensor
+) –
+ The classification logits for the batch.
+fusion_bench/method/adamerging/task_wise_adamerging.py
entropy_loss(logits)
+
+¶Compute the entropy loss of a set of logits.
+ + +Parameters:
+logits
+¶Tensor
)
+ –
+ The logits to compute the entropy loss of.
+Returns:
+Tensor
( Tensor
+) –
+ The entropy loss of the logits.
+fusion_bench/method/adamerging/task_wise_adamerging.py
clip_task_wise_adamerging
+
+
+¶
CLIPTaskWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: TaskWiseAdaMergingAlgorithm
A class for task-wise adaptive merging of CLIP models.
+This class extends the TaskWiseAdaMergingAlgorithm to provide specific +functionality for CLIP models, including loading datasets, constructing +zero-shot classification heads, and computing logits.
+ + +Attributes:
+modelpool
+ (CLIPVisionModelPool
)
+ –
+ The model pool containing CLIP models.
+_clip_processor
+ (CLIPProcessor
)
+ –
+ The CLIP processor for preparing inputs.
+zeroshot_weights
+ (dict
)
+ –
+ A dictionary to store zero-shot weights for each task.
+fusion_bench/method/adamerging/clip_task_wise_adamerging.py
51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 |
|
compute_logits(module, batch, task)
+
+¶Compute the logits for the given batch and task.
+This method computes the image embeddings, normalizes them, and calculates +the cosine similarity with the text embeddings to produce classification logits.
+ + +Parameters:
+module
+¶Module
)
+ –
+ The model module.
+batch
+¶tuple
)
+ –
+ A batch of input data.
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+Tensor
( Tensor
+) –
+ The classification logits for the batch.
+fusion_bench/method/adamerging/clip_task_wise_adamerging.py
get_shuffled_test_loader_iter(task)
+
+
+ cached
+
+
+¶Get an iterator over the shuffled test DataLoader for the task.
+ + +Parameters:
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+iterator
–
+ An iterator over the shuffled test DataLoader.
+fusion_bench/method/adamerging/clip_task_wise_adamerging.py
get_test_dataset(task)
+
+
+ cached
+
+
+¶Load the test dataset for the task. +This method is cached, so the dataset is loaded only once.
+ + +Parameters:
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+CLIPDataset
–
+ The test dataset for the task.
+fusion_bench/method/adamerging/clip_task_wise_adamerging.py
on_test_time_adaptation_start()
+
+¶Prepare for test-time adaptation.
+This method loads the CLIP processor and constructs the zero-shot +classification head for each task.
+ +fusion_bench/method/adamerging/clip_task_wise_adamerging.py
InfiniteDataLoader
+
+
+¶A wrapper class for DataLoader to create an infinite data loader. +This is useful in case we are only interested in the number of steps and not the number of epochs.
+This class wraps a DataLoader and provides an iterator that resets +when the end of the dataset is reached, creating an infinite loop.
+ + +Attributes:
+data_loader
+ (DataLoader
)
+ –
+ The DataLoader to wrap.
+data_iter
+ (iterator
)
+ –
+ An iterator over the DataLoader.
+fusion_bench/method/adamerging/clip_task_wise_adamerging.py
layer_wise_adamerging
+
+
+¶
LayerWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
, LightningFabricMixin
, SimpleProfilerMixin
Implements the Layer-Wise AdaMerging Algorithm.
+This class merges the layers of a pretrained model with those of several fine-tuned models. +The merging is controlled by layer-wise weights, which can be initialized based on a provided configuration or loaded from a file.
+ + + + + + +fusion_bench/method/adamerging/layer_wise_adamerging.py
28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 |
|
__init__(algorithm_config)
+
+¶Initialize the LayerWiseAdaMergingAlgorithm with the given configuration.
+ + +Parameters:
+algorithm_config
+¶DictConfig
)
+ –
+ The configuration for the algorithm.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
compute_logits(module, images, task)
+
+
+ abstractmethod
+
+
+¶Compute the logits for the given images and task.
+ + +Parameters:
+module
+¶The model module.
+images
+¶Tensor
)
+ –
+ The input images.
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+Tensor
( Tensor
+) –
+ The computed logits.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
construct_layer_wise_merged_model(modelpool)
+
+¶Constructs a wrapped layer-wise merged model from model pool.
+This method creates a new wrapped model by merging the layers of a pretrained model with those of several fine-tuned models.
+The merging is controlled by layer-wise weights, which is a torch.Tensor
of the shape (num_models, num_layers)
.
+The merging weights can be initialized based on a provided configuration or loaded from a file.
Parameters:
+modelpool
+¶ModelPool
)
+ –
+ An object containing the pretrained model and fine-tuned models to be merged.
+Returns:
+LayerWiseMergedModel
–
+ An instance of the merged model with layer-wise weights applied.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
get_shuffled_test_loader_iter(task)
+
+
+ abstractmethod
+
+
+¶Loader of test dataset for test-time adaptation. labels are not needed.
+ + +Parameters:
+task
+¶str
)
+ –
+ The name of the task.
+Returns:
+DataLoader
( DataLoader
+) –
+ The data loader for the test dataset.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
on_test_time_adaptation_start()
+
+¶Something to do before the test-time adaptation starts. Such as setting up the task-specific heads.
+ + +
run(modelpool)
+
+¶Run the Layer-Wise AdaMerging Algorithm.
+This method constructs the wrapped model and performs test-time adaptation if necessary.
+ + +Parameters:
+modelpool
+¶ModelPool
)
+ –
+ The model pool containing the pretrained and fine-tuned models.
+Returns:
+LayerWiseMergedModel
–
+ The merged model after test-time adaptation.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
save_merging_weights(file_path, merging_weights)
+
+¶Save the merging weights to a file.
+ + +Parameters:
+file_path
+¶str
)
+ –
+ The path to save the merging weights.
+merging_weights
+¶Tensor
)
+ –
+ The merging weights to save.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
test_time_adaptation(module)
+
+¶Perform test-time adaptation on the merged model.
+This method adapts the merging weights during test-time to improve performance.
+ + +Parameters:
+module
+¶LayerWiseMergedModel
)
+ –
+ The merged model.
+Returns:
+LayerWiseMergedModel
–
+ The adapted merged model.
+fusion_bench/method/adamerging/layer_wise_adamerging.py
clip_layer_wise_adamerging
+
+
+¶Example Usage:
+fusion_bench method=adamerging method.name=clip_layer_wise_adamerging method.save_merging_weights=merging_weights.pt modelpool=clip-vit-base-patch32_TA8 taskpool=clip-vit-classification_TA8 fabric_logger.root_dir=outputs/logs/ViT-B-32 fabric_logger.name=clip_layer_wise_adamerging_adam
+
CLIPLayerWiseAdaMergingAlgorithm
+
+
+¶
+ Bases: CLIPClassificationMixin
, LayerWiseAdaMergingAlgorithm
fusion_bench/method/adamerging/clip_layer_wise_adamerging.py
on_test_time_adaptation_start()
+
+¶Here we load the CLIP processor and construct the zero-shot classification head for each task.
+ + +(ICLR 2024) AdaMerging: Adaptive Model Merging for Multi-Task Learning. https://openreview.net/pdf?id=nZP6NgD3QY ↩
+Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014. ↩
+A. Tang, L. Shen, Y. Luo, N. Yin, L. Zhang, and D. Tao, “Merging Multi-Task Models via Weight-Ensembling Mixture of Experts,” ICML 2024. doi: 10.48550/arXiv.2402.00433. ↩
+Consider a discrete categorical distribution parameterized by logits \(\mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^{n}\), where \(x_i\) is the logit of the \(i\)-th category. The Gumbel-Max trick 123 states a reparameterization trick to sample from the categorical distribution by sampling from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\) and taking the argmax of the sum of the Gumbel random variables and the logits.
+This trick proceeds as follows: +sample \(n\) Gumbel random variables \(g_1, \dots, g_n\) independently from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\) (We can draw a random sample \(u\) from a unifrom distribution on the interval \((0,1)\) and then transform it into a Gumbel-distributed variable \(g\) using the formula \(g=-\log(-\log u)\).), find the index \(i\) of that maximizes \(x_i + g_i\), then we have
+If we represent the categorical distribution as a one-hot vector \(\mathbf{y} = (y_1, \dots, y_n) \in \{0,1\}^n\), where \(y_i=1\) indicates that the \(i\)-th category is sampled and for all \(j\neq i\), \(y_j=0\), then we have
+Since the derivative of the \({\arg\max}\) function is not defined, we cannot backpropagate the gradients through it. +To address this issue, (Maddison et al., 2017)4 proposed to use a continuous relaxation of the discrete categorical distribution. +A CONCRETE random variable (CONtinuous relaxation of disCRETE random variable) relax the condition that the one-hot vector \(\mathbf{y}\) must be located at the vertices of the \((n-1)\)-dimensional simplex \(\Delta^{n-1}\), and instead, it allows \(\mathbf{y}\) to be located anywhere inside the simplex \(\Delta^{n-1}\), i.e. \(\{ y\in \mathbb{R}^n | y_i \in [0,1], \sum_{i=1}^n y_i =1 \}\).
+To sample a Concrete random variable \(\mathbf{y}\) from a distribution that is parameterized by a temperature hyperparameter \(\lambda > 0\) and a vector of logits \(\mathbf{x} = (x_1, \dots, x_n) \in \mathbb{R}^{n}\), we have
+where \(\mathbf{g} = (g_1, \dots, g_n)\) is a vector of Gumbel random variables that are independently sampled from the standard Gumbel distribution \(\text{Gumbel}(\mu=0,\beta=1)\).
+A subspace mask \(\mathbf{m}\) is a binary vector that identifies a subspace of the parameter space. +For a neural network parametrized by \(\theta\), we can use a subspace mask \(\mathbf{m}\) to identify a subspace of the parameter space \(\mathbf{\theta}\) by setting the parameters that are not in the subspace to zero, i.e. \(\mathbf{\theta} \circ \mathbf{m}\), where \(\circ\) denotes the element-wise product. +We can draw a random sample \(\mathbf{m}\) from a Bernoulli distribution \(\text{Bernoulli}(\mathbf{p}=\sigma(\mathbf{x}))\), where \(\mathbf{p}\) is the probability (\(\mathbf{x}\) denotes the logits) of each parameter being activated. However, the discrete Bernoulli distribution is not differentiable, so we cannot backpropagate the gradients through it to optimize the parameters \(\mathbf{p}\) or \(\mathbf{x}\).
+To address this issue, we introduce the Concrete mask which can be drawn from a continuous relaxation of Bernoulli distribution. Before we introduce the Concrete mask, we first review the Gumbel-Max trick in the two-class case.
+Let \(p_0\) and \(p_1\) denote the unnormalized probabilities of a Bernoulli random variable being 0 and 1, respectively, with \(x\) representing the logits. Then, the probability of the event \(m=1\) is given by
+where \(\sigma\) denotes the sigmoid function. +In the context of the Gumbel-Max trick, the occurrence of the event \(m=1\) is determined by the condition \(g_1 + \log p_1 > g_0 + \log p_0\), where \(g_0\) and \(g_1\) are two independent standard Gumbel random variables. +Thus we have
+Because the difference of two standard Gumbel random variables is a Logistic random variable, we can replace \(g_1 - g_0\) by \(\log u - \log(1-u)\) where \(u\) is a random variable sampled from a uniform distribution on the interval \((0,1)\). +Substitute this into Eq.(\ref{eq:appendix_P_m_1}) and express the probability in terms of the logits \(x\) to simplify the expression, we have
+The binary Concrete distribution offers a continuous relaxation of the discrete Bernoulli random variables, which is beneficial for gradient-based optimization as it allows for the backpropagation of gradients even through the sampling process. +Instead of making a hard decision as the above equation, we use a temperature parameter \(\lambda\) to control the steepness of the sigmoid function, and hence control how close our 'soft' decisions are to being 'hard' decisions. The continuous version of the Bernoulli random variable is then given by
+As the temperature \(\lambda\) approaches zero, the sigmoid function becomes a step function, and the Concrete random variable \(\hat{m}\) becomes a Bernoulli random variable, as shown in the following Figure. In the limit when \(\lambda \to 0\), this results in sampling \(m=1\) if \(\log \frac{\sigma(x)}{1 - \sigma(x)} > -\log \frac{u}{1 - u}\), consistent with the original Gumbel-Max trick. +The binary Concrete distribution thus provides a differentiable approximation to Bernoulli random variables. +We can further binarize the Concrete mask by setting the entries with values greater than 0.5 to 1 and the rest to 0.
+ +Merging CLIP models on eight image classification tasks, using the concrete task arithmetic algorithm
+# tensorboard logs and learned checkpoints of the shared mask can be found at https://huggingface.co/tanganke/clip-vit-base-patch32_concrete-task-arithmetic_tblogs
+fusion_bench \
+ fabric_logger.name=ViT-B-32/concrete_task_arithmetic \
+ method=concrete_subspace/clip_concrete_task_arithmetic \
+ modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
+ taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8
+
results
+{
+ "svhn": {
+ "accuracy": 0.903003990650177,
+ "loss": 0.37700024247169495
+ },
+ "stanford_cars": {
+ "accuracy": 0.6326327323913574,
+ "loss": 1.2553859949111938
+ },
+ "resisc45": {
+ "accuracy": 0.7558730244636536,
+ "loss": 1.017554759979248
+ },
+ "eurosat": {
+ "accuracy": 0.9407407641410828,
+ "loss": 0.20871955156326294
+ },
+ "gtsrb": {
+ "accuracy": 0.8285035490989685,
+ "loss": 0.5861473679542542
+ },
+ "mnist": {
+ "accuracy": 0.9800000190734863,
+ "loss": 0.08148527890443802
+ },
+ "dtd": {
+ "accuracy": 0.5249999761581421,
+ "loss": 2.2731478214263916
+ },
+ "sun397": {
+ "accuracy": 0.6421158909797668,
+ "loss": 1.4108904600143433
+ }
+}
+
Concrete AdaMerging (Layer-wise)
+# tensorboard logs and learned checkpoints of the shared mask can be found at https://huggingface.co/tanganke/clip-vit-base-patch32_concrete-layer-wise_adamerging_tblogs
+fusion_bench \
+ fabric_logger.name=ViT-B-32/clip_concrete_layer_wise_adamerging \
+ method=concrete_subspace/clip_concrete_layer_wise_adamerging \
+ modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
+ taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8
+
+ X. Yi, S. Zheng, L. Wang, X. Wang, and L. He, “A safety realignment framework via subspace-oriented model fusion for large language models.” arXiv, May 14, 2024. doi: 10.48550/arXiv.2405.09055.
+++The paper introduces a safety realignment framework for large language models via subspace-oriented model fusion (SOMF, the authors learn a shared mask on the weight space of large language model), which combines safeguard capabilities of initially aligned models with fine-tuned models to ensure safety without compromising performance on downstream tasks.
+
E. J. Gumbel. Statistical Theory of Extreme Values and Some Practical Applications. A Series of Lectures. Technical +Report PB175818, National Bureau of Standards, Washington, D. C. Applied Mathematics Div., 1954. URL +https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDetail/PB175818.xhtml. ↩
+R. Duncan Luce. Individual Choice Behavior. Individual Choice Behavior. John Wiley, Oxford, England, 1959 ↩
+Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. Advances in neural information processing systems, +27, 2014. ↩
+Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete +Random Variables, March 2017. URL http://arxiv.org/abs/1611.00712. ↩
+The DepthUpscalingAlgorithm
is used to upscale the depth of PyTorch models. Here's a basic guide on how to use it:
First, import the necessary modules:
+from omegaconf import DictConfig
+from torch import nn
+from fusion_bench.method.depth_upscaling import DepthUpscalingAlgorithm
+from fusion_bench.modelpool import to_modelpool
+
Create an instance of DepthUpscalingAlgorithm
by passing a configuration dictionary.
+This dictionary should contain the name of the method ("depth_upscaling") and a list of layer indices that determine the upscaling pattern.
method_config = {"name": "depth_upscaling", "layer_indices": [0, 1, 1, 0]}
+algorithm = DepthUpscalingAlgorithm(DictConfig(method_config))
+
Assume we have a list of PyTorch models (nn.ModuleList
instances) that we want to upscale. Here, we're creating a list of linear models as an example:
Then, we can the model to the run
method of our algorithm:
The run
method will return an upscaled model. The type of the returned model will be the same as the input models (in this case, nn.ModuleList
), and its length will be determined by the layer indices specified in the method configuration.
Here we provide an example of how to use the DepthUpscalingAlgorithm
to upscale the depth of a Mistral model 1.
from omegaconf import DictConfig
+from torch import nn
+from transformers import AutoModelForCausalLM, MistralConfig, MistralForCausalLM
+from fusion_bench.method.depth_upscaling import DepthUpscalingAlgorithm
+
+# create a Mistral model
+# here we randomly initialize the model for demonstration purposes
+# in practice, you would load a pretrained model
+model_config = MistralConfig(
+ # https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/main/config.json
+ **{
+ "architectures": ["MistralForCausalLM"],
+ "bos_token_id": 1,
+ "eos_token_id": 2,
+ "hidden_act": "silu",
+ "hidden_size": 4096,
+ "initializer_range": 0.02,
+ "intermediate_size": 14336,
+ "max_position_embeddings": 32768,
+ "model_type": "mistral",
+ "num_attention_heads": 32,
+ "num_hidden_layers": 32,
+ "num_key_value_heads": 8,
+ "rms_norm_eps": 1e-05,
+ "rope_theta": 10000.0,
+ "sliding_window": 4096,
+ "tie_word_embeddings": False,
+ "torch_dtype": "bfloat16",
+ "transformers_version": "4.34.0.dev0",
+ "use_cache": True,
+ "vocab_size": 32000,
+ }
+)
+print('creating model')
+model: MistralForCausalLM = AutoModelForCausalLM.from_config(model_config)
+
+method_config = {
+ "name": "depth_upscaling",
+ "layer_indices": ["range(0,24)", "range(8,32)"],
+}
+algorithm = DepthUpscalingAlgorithm(DictConfig(method_config))
+print('upscaling model')
+upscaled_model = algorithm.run(model.model.layers)
+
+# substitute the model with the upscaled model
+model.model.layers = upscaled_model
+
The DepthUpscalingAlgorithm
is integrated into the fusion_bench
package. You can use it by specifying "depth_upscaling"
as the method name in the command line or configuration file.
name: depth_upscaling
+# this should be a list of integers or string, indicating the sequence of layers. If the entry is an integer, it will use the n-th layer of the model. If the entry is a string, it will use the layers specified by the string. The string should be a valid python expression that evaluates to a list of integers.
+# for example, ["range(0,12)", "range(6,12)"] will use the first 12 layers and the last 6 layers of the model to construct the new model
+# [0, 2, 4, "range(6,12)"] will use the 1st, 3rd, 5th, and the 7th to 12th layers of the model to construct the new model
+layer_indices: null
+
You can then run the fusion_bench
command with the specified configuration file:
DepthUpscalingAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
Implements the Depth Upscaling Algorithm.
+This class extends the BaseModelFusionAlgorithm
to handle depth upscaling of models.
+It supports upscaling the depth of a model by duplicating specified layers.
Parameters:
+layer_indices
+¶list
)
+ –
+ List of layer indices to duplicate.
+**kwargs
+¶Additional keyword arguments.
+fusion_bench/method/depth_upscaling/depth_upscaling.py
run(modelpool)
+
+¶Executes the depth upscaling algorithm on a given model pool.
+This method checks the type of the model pool, ensures that it contains only one model, and verifies that the model is an instance of nn.ModuleList
.
Parameters:
+modelpool
+¶ModuleList | ModelPool
)
+ –
+ The pool of models to upscale. Must contain only one model.
+Returns:
+ModuleList
+ –
+ nn.ModuleList: The upscaled model.
+Raises:
+AssertionError
+ –
+ If the model pool contains more than one model or if the model is not an instance of nn.ModuleList
.
ValueError
+ –
+ If an invalid layer specification is provided in the configuration.
+fusion_bench/method/depth_upscaling/depth_upscaling.py
The Dummy Algorithm is a simple algorithm that does not perform any fusion operation. Instead, it returns a pretrained model if one is available in the model pool. If no pretrained model is available, it returns the first model in the model pool. +This algorithm is useful for testing and debugging purposes, as it allows you to quickly check if the model pool is set up correctly and the fusion process is working as expected.
+To use the Dummy Algorithm, you need to specify "dummy"
as the algorithm name.
The implementation of the Dummy Algorithm is straightforward. Here is the main method of the DummyAlgorithm
class:
DummyAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
fusion_bench/method/dummy.py
The Fisher merging algorithm 1 is a per-parameter weighed averaging method that assigns weights to the models based on the Fisher information matrix of the models on some labeled data. +The Fisher information matrix \(F_\theta\) of a model with parameters \(\theta\) can be expressed as:
+where \(p(x)\) is the data distribution, \(p(y|x, \theta)\) is the model's output distribution, for example, the softmax output of a classification model, and \(\nabla_\theta\) is the gradient with respect to the model's parameters \(\theta\). +The Fisher information matrix can be used to estimate the importance of each parameter in the model and thus assign weights to the models based on their Fisher information. +In addition, the Fisher information matrix can be used to estimate the similarity between tasks, which can be useful in auxiliary-task learning and multi-task learning scenarios 2.
+As the full Fisher information matrix is often computationally expensive to compute and memory-intensive to store, we approximate using the diagonal Fisher information matrix, which is the diagonal of the full Fisher information matrix. +The diagonal Fisher information matrix can be computed as:
+Assuming we have \(n\) models with parameters \(\theta_i\) and diagonal Fisher information matrices \(\hat{F}_{\theta_i}\), the Fisher merging algorithm computes the merged model's parameters \(\theta\) as follows:
+where \(\theta_i\) are the parameters of the individual models, \(\hat{F}_{\theta_i}\) are the diagonal Fisher information matrices of the individual models, and \(j\) indexes the parameters of the models. +The Fisher merging algorithm can be considered a per-weight weighed averaging method, where the weights are determined by the Fisher information of each parameter in the models.
+Example of merging eight CLIP-ViT-B/32 models using Fisher merging:
+fusion_bench method=clip_fisher_merging \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
Merge eight CLIP-ViT-L/14 models using Fisher merging:
+fusion_bench \
+ method=clip_fisher_merging \
+ method.batch_size=8 method.num_workers=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
Merge GPT-2 models for text classification tasks:
+fusion_bench \
+ method=gpt2_fisher_merging \
+ method.num_fisher_examples=512 method.batch_size=8 \
+ modelpool=gpt-2_glue \
+ taskpool=gpt-2_glue
+
FisherMergingAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
Implements the Fisher Merging Algorithm.
+This class extends the BaseModelFusionAlgorithm to handle merging of models using Fisher weights. +It supports excluding certain parameters, normalizing Fisher weights, and setting a minimal value for Fisher weights.
+ + +Methods:
+run
+ –
+ BaseModelPool) -> nn.Module: +Executes the Fisher merging process on the model pool and returns the merged model.
+fusion_bench/method/fisher_merging/fisher_merging.py
355 +356 +357 +358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 |
|
get_fisher_weights(model_name, model, train_dataset, param_names_to_merge)
+
+¶Compute the Fisher weights for the given model and training dataset.
+ + +Parameters:
+model_name
+¶str
)
+ –
+ The name of the model.
+model
+¶Module
)
+ –
+ The model module.
+train_dataset
+¶The training dataset.
+param_names_to_merge
+¶List[str]
)
+ –
+ List of parameter names to merge.
+Returns:
+Dict[str, Tensor]
+ –
+ Dict[str, Tensor]: The computed Fisher weights for each parameter.
+fusion_bench/method/fisher_merging/fisher_merging.py
on_fisher_merging_start()
+
+¶Setup the zero-shot classification head before starting the Fisher merging process.
+ +fusion_bench/method/fisher_merging/fisher_merging.py
run(modelpool)
+
+¶Run the Fisher Merging Algorithm.
+This method constructs the wrapped model and performs test-time adaptation if necessary.
+ + +Parameters:
+modelpool
+¶BaseModelPool
)
+ –
+ The model pool containing the pretrained and fine-tuned models.
+Returns:
+Module
+ –
+ nn.Module: The merged model after test-time adaptation.
+fusion_bench/method/fisher_merging/fisher_merging.py
388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 |
|
The Fusion Algorithm
module is a core component of the FusionBench project, dedicated to the implementation and execution of various model fusion techniques.
+This module provides the mechanisms necessary to combine multiple models from the Model Pool, enabling nuanced and optimized model merging operations.
Fusion Algorithm
Module¶The module is typically invoked through a configuration-driven approach in CLI scripts, enabling users to specify fusion algorithms and parameters via YAML configuration files. This method ensures reproducibility and ease of use. +For more information, see the document of fusion_bench CLI.
+ModelFusionAlgorithm
is the base class for all fusion algorithms in the Fusion Algorithm module.
+It provides a common interface for different fusion techniques, allowing for seamless integration and execution of various algorithms.
Implement your own model fusion algorithm:
+from fusion_bench.method import BaseModelFusionAlgorithm
+from fusion_bench.modelpool import BaseModelPool
+
+class DerivedModelFusionAlgorithm(BaseModelFusionAlgorithm):
+ """
+ An example of a derived model fusion algorithm.
+ """
+
+ # _config_mapping maps the attribution to the corresponding key in the configuration file.
+ _config_mapping = BaseModelFusionAlgorithm._config_mapping | {
+ "hyperparam_attr_1": "hyperparam_1",
+ "hyperparam_attr_2": "hyperparam_2",
+ }
+
+ def __init__(self, hyperparam_1, hyperparam_2, **kwargs):
+ self.hyperparam_attr_1 = hyperparam_1
+ self.hyperparam_attr_2 = hyperparam_2
+ super().__init__(**kwargs)
+
+ def run(self, modelpool: BaseModelPool):
+ # implement the fusion algorithm here
+ raise NotImplementedError(
+ "DerivedModelFusionAlgorithm.run() is not implemented."
+ )
+
We provide a simple example to illustrate how the algorithm is used in the FusionBench as follows:
+import logging
+from typing import Dict, Optional
+from omegaconf import DictConfig
+
+from fusion_bench.utils import instantiate
+
+log = logging.getLogger(__name__)
+
+def run_model_fusion(
+ method_config: DictConfig,
+ modelpool_config: DictConfig,
+ taskpool_config: Optional[DictConfig] = None,
+ seed: Optional[int] = None,
+ print_config: bool = True,
+ **kwargs
+):
+ """
+ Run the model fusion process.
+
+ Args:
+ method_config: Configuration for the fusion method.
+ modelpool_config: Configuration for the model pool.
+ taskpool_config: Configuration for the task pool (optional).
+ """
+ # Instantiate components: modelpool, method, and taskpool
+ modelpool = instantiate(modelpool_config)
+ method = instantiate(method_config)
+ taskpool = None
+ if taskpool_config is not None:
+ taskpool = instantiate(taskpool_config)
+
+ # Run fusion
+ merged_model = method.run(modelpool)
+
+ # Evaluate if taskpool is provided
+ if taskpool is not None:
+ report = taskpool.evaluate(merged_model)
+
In summary, the Fusion Algorithm module is vital for the model merging operations within FusionBench, leveraging sophisticated techniques to ensure optimal fusion and performance evaluation of deep learning models. This capability makes it an indispensable tool for researchers and practitioners focusing on model fusion strategies.
+
BaseAlgorithm
+
+
+¶
+ Bases: BaseYAMLSerializableModel
Base class for model fusion algorithms.
+This class provides a template for implementing model fusion algorithms.
+Subclasses must implement the run
method to define the fusion logic.
fusion_bench/method/base_algorithm.py
run(modelpool)
+
+
+ abstractmethod
+
+
+¶Fuse the models in the given model pool.
+This method must be implemented by subclasses to define the fusion logic.
+ + +Examples:
+>>> algorithm = SimpleAverageAlgorithm()
+>>> modelpool = ModelPool()
+>>> merged_model = algorithm.run(modelpool)
+
Parameters:
+modelpool
+¶BaseModelPool
)
+ –
+ The pool of models to fuse.
+fusion_bench/method/base_algorithm.py
BaseModelFusionAlgorithm = BaseAlgorithm
+
+
+ module-attribute
+
+
+¶Alias for BaseAlgorithm
.
The max-model predictor algorithm is a type of ensemble method. +Formally, a max-model predictor is defined as follows:
+Definition (Max-Model Predictor) 1 +Given a set of predictors \(H = \{h_1, h_2, \ldots, h_n\}\), with \(h_i: \mathcal{X} \times \mathcal{Y}_i \mapsto \mathbb{R}\), the max-model predictor \(h_H\) is defined as:
+Take the flu detection problem as an example 1. +Doctors want to build a learning model to detect what type of virus one patient is affected based on her symptoms, for appropriate treatment. However, the types of influenza diverse geographically (Rejmanek et al., 2015), which means the distribution of patient records collected by a hospital in California may be different from those in Florida. In an extreme case, some types are unknown to the other hospital. Assume there are 4 types of influenza in the United States. In California, 2 of 4 are commonly detected, while in Florida 3 of 4 types are often detected. We assume in the two states, doctors separately trained two models \(h_{CA}\) and \(h_{FL}\) which work locally well in California and Florida respectively. However, a direct ensemble of the two local models may not work well on all the patients. Let \(h_{US}\) denote the ideal global model trained on the combination of local datasets. When we input a patient record \(x\), each model outputs its prediction as shown in the following table:
+Table: Example of flu detection on a patient \(x\) affected with type 2 flu. “−” means this model is not able to predict the corresponding class. Taking the maximal score as prediction, \(h_{FL}\) is consistent with \(h_{US}\), but the combination of two local models \(h_{CA,FL}\) is not since \(3/4 > 4/7\).
+Type | +1 | +2 | +3 | +4 | +
---|---|---|---|---|
\(h_{US}(x)\) | +2/10 | +4/10 | +1/10 | +3/10 | +
\(h_{CA}(x)\) | +- | +- | +1/4 | +3/4 | +
\(h_{FL}(x)\) | +2/7 | +4/7 | +1/7 | +- | +
\(h_{\{CA,FL\}}(x)\) | +2/7 | +4/7 | +1/4 | +3/4 | +
Here is an example of how to use the Max-Model Predictor Algorithm:
+from fusion_bench.method import MaxModelPredictorAlgorithm
+from fusion_bench.modelpool import ModelPool
+
+# Instantiate the MaxPredictorAlgorithm
+algorithm = MaxModelPredictorAlgorithm()
+
+# Assume we have a ModelPool instance that contains the models we want to ensemble.
+modelpool = ModelPool(...) # or a list of nn.Module
+
+# Run the algorithm on the model pool.
+max_model_predictor : nn.Module = algorithm.run(modelpool)
+
Configuration template for the Max Predictor Algorithm:
+ +To create a max predictor ensemble of models for a specific task, you can use the following command:
+ + + + + + + + + + + + + + + +ModelRecombinationAlgorithm
is a class used to recombine models in a model pool. Here's how to use it:
First, import the necessary modules:
+from fusion_bench.method import ModelRecombinationAlgorithm
+from fusion_bench.modelpool import ModelPool, to_modelpool
+from torch import nn
+
Create an instance of ModelRecombinationAlgorithm
:
Create a model pool using the to_modelpool
function. This function takes a list of models or a dict of models and converts it into a ModelPool
:
Use the run
method of the ModelRecombinationAlgorithm
instance to recombine the models in the model pool:
The run
method takes two arguments:
modelpool
: The model pool to recombine.return_modelpool
(optional): A boolean indicating whether to return the entire model pool or just the first model. Defaults to True
.If return_modelpool
is True
, the run
method returns a new ModelPool
with the recombined models. If False
, it returns the first model from the new model pool.
You can check the type of the returned value to ensure that the run
method worked correctly:
Configuration template for the model recombination algorithm:
+name: model_recombination
+# if `return_model_pool` is not null, the argument `return_modelpool` passed to the `run` method will be ignored.
+return_modelpool: null
+
Construct a model recombination using our CLI tool fusion_bench
:
fusion_bench \
+ method=model_recombination \
+ method.return_modelpool=false \
+ modelpool=... \
+ taskpool=...
+
ModelRecombinationAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
Model recombination recombinates the layers of the given models, to create a new set of models.
+ + + + + + +fusion_bench/method/model_recombination.py
55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 |
|
run(modelpool, return_modelpool=True)
+
+¶Executes the model recombination algorithm on a given model pool.
+This method loads models from the model pool, determines their type, and applies the appropriate recombination method.
+It then creates a new model pool with the recombined models. Depending on the return_modelpool
flag, it either returns
+the entire new model pool or just the first model from it.
nn.ModuleList
, the recombination method recombine_modellist
is used. Where each module in the list is shuffled across the models.nn.ModuleDict
, the recombination method recombine_modeldict
is used. Where each module in the dictionary is shuffled across the models.nn.Module
, the recombination method recombine_state_dict
is used. Where the state dictionaries of the models are shuffled across the models.Parameters:
+modelpool
+¶BaseModelPool
)
+ –
+ The pool of models to recombine.
+return_modelpool
+¶bool
, default:
+ True
+)
+ –
+ Flag indicating whether to return the entire model pool or just the first model. Defaults to True. If this algorithm is initialized with config, the value of return_modelpool
in the config will be used and this argument passed to the method will be ignored.
Returns:
+Union[Module, BaseModelPool]
+ –
+ Union[nn.Module, BaseModelPool]: The recombined model pool or the first model from the recombined pool, depending on the return_modelpool
flag.
Raises:
+ValueError
+ –
+ If the models in the model pool are of an unsupported type.
+fusion_bench/method/model_recombination.py
recombine_modellist(models)
+
+¶fusion_bench/method/model_recombination.py
recombine_modeldict(models)
+
+¶fusion_bench/method/model_recombination.py
recombine_state_dict(models)
+
+¶fusion_bench/method/model_recombination.py
Here we provides instructions on how to use the fusion_bench
command-line interface to merge models using a Mixture of Experts (MoE) approach.
The first code block is a YAML configuration file for the merging method. The name
field specifies the name of the merging method. The num_experts
field specifies the number of experts to use in the merging process. The experts_per_token
field specifies the number of experts to use per token. The save_checkpoint
field specifies the path where the merged model will be saved.
name: mixtral_for_causal_lm_moe_merging
+
+experts_per_token: 2
+# path to save the merged model, if provided
+save_checkpoint: null
+
The second code block is another YAML configuration file, this time for the model pool. The type
field specifies the type of model pool to use. The models
field is a list of models to include in the pool. Each model should have a name
and a path
, and the model is loaded from the path.
type: AutoModelForCausalLMPool
+# each model should have a name and a path, and the model is loaded from the path
+# this is equivalent to `AutoModelForCausalLM.from_pretrained(path)`
+models:
+ - name: _pretrained_
+ path: path_to_your_pretrained_model
+ - name: expert_1
+ path: path_to_your_expert_model_1
+ - name: expert_2
+ path: path_to_your_expert_model_2
+ - name: expert_3
+ path: path_to_your_expert_model_3
+ - name: expert_4
+ path: path_to_your_expert_model_4
+
Finally, the third code block is a bash command that runs the fusion_bench
command-line interface with the specified method, model pool, and task pool. The method
argument specifies the merging method to use. The modelpool
argument specifies the model pool to use. The modelpool.models.0.path
argument specifies the path to the pretrained model to use. The taskpool
argument specifies the task pool to use. In this case, a dummy task pool is used that does nothing but print the parameter counts of the merged model.
fusion_bench \
+ method=mixtral_moe_merging \
+ modelpool=mixtral_moe_merging \
+ taskpool=dummy # this is a dummy taskpool that does nothing but print the parameter counts of the merged model
+
This guide provides a step-by-step process for merging models using the fusion_bench
command-line interface. By following these instructions, you can merge your own models and save them for future use.
mixtral_merging
+
+
+¶
MixtralForCausalLMMergingAlgorithm
+
+
+¶
+ Bases: MixtralForCausalLMUpscalingAlgorithm
This class is responsible for merging models into a MixtralForCausalLM
.
fusion_bench/method/mixture_of_experts/mixtral_merging.py
run(modelpool)
+
+¶Runs the merging process. It first upscales the models to MixtralForCausalLM, +then substitutes the experts of the MixtralForCausalLM with the models from the modelpool.
+ + +Parameters:
+modelpool
+¶ModelPool
)
+ –
+ The pool of models to be merged. Each model in the pool will be treated as an expert, and should be a MistralForCausalLM
or LlamaForCausalLM
.
Returns:
+MixtralForCausalLM
( MixtralForCausalLM
+) –
+ The merged model.
+fusion_bench/method/mixture_of_experts/mixtral_merging.py
MixtralMoEMergingAlgorithm
+
+
+¶
+ Bases: MixtralUpscalingAlgorithm
This class is responsible for merging models into a MixtralModel.
+ + + + + + +fusion_bench/method/mixture_of_experts/mixtral_merging.py
run(modelpool)
+
+¶Runs the merging process.
+ + +Parameters:
+modelpool
+¶ModelPool
)
+ –
+ The pool of models to be merged. Each model in the pool will be treated as an expert, and should be a MistralModel
or LlamaModel
.
Returns:
+MixtralModel
( MixtralModel
+) –
+ The merged model.
+fusion_bench/method/mixture_of_experts/mixtral_merging.py
Sparse upcycling is a technique used to initialize a sparsely activated Mixture-of-Experts (MoE) model from a dense checkpoint. This approach leverages previously incurred training costs to improve the performance of large models while reducing the computational expense. In the process, dense Transformer blocks are partially replaced with MoE blocks, where the MLPs in a Transformer block are replaced by multiple experts. The experts are chosen based on routing probabilities determined by a router. The initialized MoE model is then further trained to recover the performance. This method results in improved performance for both language and vision models while using only a fraction of the original dense pretraining cost 1.
+Here’s an example demonstrating how to upscale a pre-trained Mistral model to a Mixtral model:
+import os
+
+from omegaconf import DictConfig
+from transformers import MistralForCausalLM
+
+from fusion_bench.method.mixture_of_experts.mixtral_upcycling import (
+ MixtralForCausalLMUpscalingAlgorithm,
+)
+from fusion_bench.utils import print_parameters
+
+# Load a pre-trained Mistral model
+pretrained_model = MistralForCausalLM.from_pretrained(
+ os.path.expanduser("path_to_mistral_model")
+)
+print("Pretrained model:")
+print_parameters(pretrained_model)
+# Output:
+# Pretrained model:
+# trainable params: 7.24B || all params: 7.24B || trainable%: 100.0000
+
+# Define the configuration for Mixtral
+config = {
+ "num_experts": 4, # Number of expert channels
+ "experts_per_token": 2, # Experts to choose per token
+}
+
+# Initialize the upscaling algorithm
+upscaling_for_causal_lm_algorithm = MixtralForCausalLMUpscalingAlgorithm(
+ DictConfig(config)
+)
+
+# Run the upscaling process to get a Mixtral model
+mixtral_for_causal_lm_model = upscaling_for_causal_lm_algorithm.run(pretrained_model)
+
+print("Mixtral model:")
+print_parameters(mixtral_for_causal_lm_model)
+# Outputs:
+# Mixtral model:
+# trainable params: 24.15B || all params: 24.15B || trainable%: 100.0000
+
+# Save the upscaled Mixtral model
+mixtral_for_causal_lm_model.save_pretrained("path_to_save_mixtral_model")
+
A Jupyter notebook example is also available at our repo.
+This is a guide on how to use the fusion_bench
command-line interface to upscale a Mistral model to a Mixtral model.
The first code block is a YAML configuration file for the upscaling method. The name field specifies the name of the upscaling method. The num_experts
field specifies the number of experts to use in the upscaling process. The experts_per_token
field specifies the number of experts to use per token. The save_checkpoint
field specifies the path where the upscaled model will be saved, if provided.
name: mixtral_for_causal_lm_moe_upscaling # or "mixtral_moe_upscaling"
+
+num_experts: 4
+experts_per_token: 2
+# path to save the upscaled model
+save_checkpoint: null
+
The second code block is another YAML configuration file, this time for the model pool. The type
field specifies the type of model pool to use. The models
field is a list of models to include in the pool. Each model should have a name
and a path
, and the model is loaded from the path
.
type: AutoModelForCausalLMPool
+# each model should have a name and a path, and the model is loaded from the path
+# this is equivalent to `AutoModelForCausalLM.from_pretrained(path)`
+models:
+ - name: _pretrained_
+ path: path_to_your_pretrained_model
+
Finally, the third code block is a bash command that runs the fusion_bench command-line interface with the specified method, model pool, and task pool. The method argument specifies the upscaling method to use. The modelpool argument specifies the model pool to use. The modelpool.models.0.path argument specifies the path to the pretrained model to use. The taskpool argument specifies the task pool to use. In this case, a dummy task pool is used that does nothing but print the parameter counts of the merged model.
+fusion_bench \
+ method=mixtral_moe_upscaling \
+ modelpool=mixtral_moe_upscaling \
+ modelpool.models.0.path=path_to_your_pretrained_model \
+ taskpool=dummy # this is a dummy taskpool that does nothing but print the parameter counts of the merged model
+
mixtral_upcycling
+
+
+¶
MixtralForCausalLMUpscalingAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
This class is responsible for upscaling a model to a MixtralForCausalLM. +It inherits from the ModelFusionAlgorithm class.
+ + + + + + +fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 +290 +291 +292 +293 +294 +295 +296 +297 +298 +299 +300 +301 +302 +303 +304 +305 +306 +307 +308 +309 +310 +311 +312 +313 +314 +315 +316 +317 +318 +319 +320 +321 +322 +323 +324 +325 +326 +327 +328 +329 |
|
__init__(num_experts, experts_per_token, save_checkpoint, **kwargs)
+
+¶Initialize the MixtralForCausalLMUpscalingAlgorithm.
+ + +Parameters:
+num_experts
+¶int
)
+ –
+ The number of experts in the Mixtral model.
+experts_per_token
+¶int
)
+ –
+ The number of experts per token.
+save_checkpoint
+¶str
)
+ –
+ The path to save the checkpoint.
+**kwargs
+¶Additional keyword arguments.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
+
+¶Runs the upscaling process.
+ + +Parameters:
+modelpool
+¶ModelPool | LlamaForCausalLM | MistralForCausalLM
)
+ –
+ The model to be upscaled.
+Returns:
+MixtralForCausalLM
( MixtralForCausalLM
+) –
+ The upscaled model.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
MixtralUpscalingAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
This class is responsible for upscaling a model to a MixtralModel. +It inherits from the ModelFusionAlgorithm class.
+ + + + + + +fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 |
|
__init__(num_experts, experts_per_token, save_checkpoint, **kwargs)
+
+¶Initialize the MixtralUpscalingAlgorithm.
+ + +Parameters:
+num_experts
+¶int
)
+ –
+ The number of experts in the Mixtral model.
+experts_per_token
+¶int
)
+ –
+ The number of experts per token.
+save_checkpoint
+¶str
)
+ –
+ The path to save the checkpoint.
+**kwargs
+¶Additional keyword arguments.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
run(modelpool)
+
+¶Runs the upscaling process.
+ + +Parameters:
+modelpool
+¶ModelPool | LlamaModel | MistralModel
)
+ –
+ The model to be upscaled.
+Returns:
+MixtralModel
( MixtralModel
+) –
+ The upscaled model.
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_for_causal_lm(input_model, output_model)
+
+¶A helper function.
+Upscales a LlamaForCausalLM or MistralForCausalLM to a MixtralForCausalLM.
+ + +Parameters:
+input_model
+¶LlamaForCausalLM | MistralForCausalLM
)
+ –
+ The input model to be upscaled.
+output_model
+¶MixtralForCausalLM
)
+ –
+ The output model where the upscaled weights will be loaded.
+Returns:
+None
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
upscale_to_mixtral_model(input_model, output_model)
+
+¶A helper function.
+Upscales a LlamaModel or MistralModel to a MixtralModel.
+ + +Parameters:
+input_model
+¶LlamaModel | MistralModel
)
+ –
+ The input model to be upscaled.
+output_model
+¶MixtralModel
)
+ –
+ The output model where the upscaled weights will be loaded.
+Returns:
+None
+fusion_bench/method/mixture_of_experts/mixtral_upcycling.py
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. http://arxiv.org/abs/2212.05055 ↩
+The following command prunes a Llama model with a sparsity ratio of 0.7 (70% of the weights are pruned) using unstructured magnitude pruning. The pruned model is saved to outputs/llama/magnitude_pruning/unstructured/0.7
.
fusion_bench \
+ --config-name llama_magnitude_pruning \
+ method.prune_type=unstructured \
+ method.sparsity_ratio=0.7 \
+ modelpool.models.0.path=decapoda-research/llama-7b-hf \
+ merged_model_save_path=outputs/llama/magnitude_pruning/unstructured/0.7
+
The following command prunes a Llama model with a 2:4 semi-structured pruning ratio using magnitude pruning. The pruned model is saved to outputs/llama/magnitude_pruning/semistructure/2_4
.
fusion_bench \
+ --config-name llama_magnitude_pruning \
+ method.prune_type=semistructured \
+ method.n=2 method.m=4 \
+ modelpool.models.0.path=decapoda-research/llama-7b-hf \
+ merged_model_save_path=outputs/llama/magnitude_pruning/semistructure/2_4
+
Below is an example of how to visualize the pruned weights of the first layer of the pruned model.
+from transformers import AutoModelForCausalLM
+import matplotlib.pyplot as plt
+import seaborn as sns
+import torch
+
+# Load the pruned model
+model = AutoModelForCausalLM.from_pretrained("outputs/llama/magnitude_pruning/semistructure/2_4")
+
+# Extract the tensor data
+tensor_data = model.model.layers[0].self_attn.q_proj.weight[:32, :32]
+
+# Convert to NumPy array
+tensor_data_np = tensor_data.detach().cpu().numpy()
+
+# Plot heatmap
+plt.figure(figsize=(10, 8))
+ax = sns.heatmap(tensor_data_np, center=0, cmap="coolwarm", annot=False)
+
+# Add grid lines for 4x4 cells
+for i in range(0, tensor_data_np.shape[0], 4):
+ ax.axhline(i, color="black", linewidth=0.5)
+ ax.axvline(i, color="black", linewidth=0.5)
+
+plt.title("Heatmap of q_proj.weight[:32, :32]")
+plt.show()
+
The following image shows the pruned weights of the first layer of the pruned model.
+ +
MagnitudePruningForLlama
+
+
+¶
+ Bases: BaseAlgorithm
, SimpleProfilerMixin
Implements magnitude-based pruning for LLama models.
+This class supports both unstructured and semistructured pruning methods. +It loads a pre-trained model or the first model in the pool and applies the specified pruning technique.
+ + +Methods:
+run
+ –
+ LLamaForCausalLMPool) -> nn.Module: +Executes the pruning process on the model pool and returns the pruned model.
+fusion_bench/method/pruning/llama_magnitude_prune.py
129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 |
|
run(modelpool)
+
+¶Execute the pruning process on the first model from the given model pool.
+ + +Parameters:
+modelpool
+¶CausalLMPool
)
+ –
+ The model pool containing the models to prune.
+Returns:
+nn.Module: The pruned model.
+fusion_bench/method/pruning/llama_magnitude_prune.py
Abstract
+Solving multi-objective optimization problems for large deep neural networks is a challenging task due to the complexity of the loss landscape and the expensive computational cost of training and evaluating models. +Efficient Pareto front approximation of large models enables multi-objective optimization for various tasks such as multi-task learning and trade-off analysis. +Existing algorithms for learning Pareto set, including (1) evolutionary, hypernetworks, and hypervolume-maximization methods, are computationally expensive and have restricted scalability to large models; +(2) Scalarization algorithms, where a separate model is trained for each objective ray, which is inefficient for learning the entire Pareto set and fails to capture the objective trade-offs effectively. +Inspired by the recent success of model merging, we propose a practical and scalable approach to Pareto set learning problem via mixture of experts (MoE) based model fusion. +By ensembling the weights of specialized single-task models, the MoE module can effectively capture the trade-offs between multiple objectives and closely approximate the entire Pareto set of large neural networks. +Once the routers are learned and a preference vector is set, the MoE module can be unloaded, thus no additional computational cost is introduced during inference. +We conduct extensive experiments on vision and language tasks using large-scale models such as CLIP-ViT and GPT-2. +The experimental results demonstrate that our method efficiently approximates the entire Pareto front of large models. +Using only hundreds of trainable parameters of the MoE routers, our method even has lower memory usage compared to linear scalarization and algorithms that learn a single Pareto optimal solution, and are scalable to both the number of objectives and the size of the model. +Our method significantly reduces the computational burden of learning the Pareto set, for example, in the two-task case, it can be achieved in just a few minutes. +Code is available at: GitHub .
+Not tested yet
+The examples provided below have not been tested yet.
+For a thoroughly tested and verified implementation of the algorithm, please refer to the original repository: tanganke/pareto_set_learning . +Additionally, the experimental results and further insights into the algorithm can be found in the original research paper: arXiv:2406.09770 .
+PWEMoE-LS on eight image classification tasks using CLIP-ViT-B/32 models, and the results are logged to outputs/logs/ViT-B-32/PWEMoE-LS-8tasks
.
fusion_bench \
+ method=pwe_moe_ls_for_clip \
+ modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
+ taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8 \
+ fabric_logger.root_dir=outputs/logs/ViT-B-32 \
+ fabric_logger.name=PWEMoE-LS-8tasks
+
clip_pwe_moe
+
+
+¶
PWEMoEAlgorithmForCLIP
+
+
+¶
+ Bases: BaseAlgorithm
, SimpleProfilerMixin
, CLIPClassificationMixin
fusion_bench/method/pwe_moe/clip_pwe_moe.py
34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 +248 +249 +250 +251 +252 +253 +254 +255 +256 +257 +258 +259 +260 +261 +262 +263 +264 +265 +266 +267 +268 +269 +270 +271 +272 +273 +274 +275 +276 +277 +278 +279 +280 +281 +282 +283 +284 +285 +286 +287 +288 +289 |
|
compute_loss(model, ray, losses)
+
+
+ abstractmethod
+
+
+¶Computes the overall losses using the given preference ray.
+ + +Parameters:
+model
+¶Module
)
+ –
+ The model being trained.
+ray
+¶Tensor
)
+ –
+ A tensor representing the preference ray, which contains the weights for each objective.
+losses
+¶List[Tensor]
)
+ –
+ A list of loss values for each objective.
+fusion_bench/method/pwe_moe/clip_pwe_moe.py
load_clip_models()
+
+¶Loads the pretrained CLIP model and the fine-tuned models for each dataset specified in the configuration.
+ +fusion_bench/method/pwe_moe/clip_pwe_moe.py
setup_train_loaders()
+
+¶Loads the datasets specified in the configuration.
+ +fusion_bench/method/pwe_moe/clip_pwe_moe.py
PWEMoELinearScalarizationForCLIP
+
+
+¶
+ Bases: PWEMoEAlgorithmForCLIP
fusion_bench/method/pwe_moe/clip_pwe_moe.py
PWEMoExactParetoOptimalForCLIP
+
+
+¶
+ Bases: PWEMoEAlgorithmForCLIP
fusion_bench/method/pwe_moe/clip_pwe_moe.py
Merge CLIP-ViT-B/32 models on eight image classification tasks
+fusion_bench method=clip_regmean \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
Merge CLIP-ViT-L/14 models on eight image classification tasks
+fusion_bench \
+ method=clip_regmean \
+ method.batch_size=8 method.num_workers=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
Merge GPT-2 models for text classification tasks:
+ +Xisen Jin, et al. "Dataless Knowledge Fusion by Merging Weights of Language Models." http://arxiv.org/abs/2212.09849 ↩
+Simple averaging is known in the literature as isotropic merging, ModelSoups, aims to yield a more robust and generalizable model. +Simple Averaging is a technique frequently employed when there are multiple models that have been fine-tuned or independently trained from scratch. +Specifically, if we possess \(n\) models that share a common architecture but different weights denoted as \(\theta_i\), the weights of the merged model, represented as \(\theta\), are computed as follows:
+This equation simply states that each weight of the final model is the average of the corresponding weights in the individual models. For example, if we have three models and the weight of the first neuron in the first layer is 0.1, 0.2, and 0.3 in each model respectively, the weight of that neuron in the final model will be (0.1 + 0.2 + 0.3) / 3 = 0.2.
+Simple averaging is a straightforward and scalable method for model fusion. It does not require any additional training or fine-tuning, making it a good choice when computational resources are limited, where maintaining an ensemble of models is not feasible.
+This method often assumes that all models are equally good. +If some models are significantly better than others, it might be beneficial to assign more weight to the better models when averaging. +This can be done by using weighted averaging, where each model's contribution to the final model is weighted by its performance on a validation set or some other metric. +See Weighed Averaging for more details. +Otherwise, the poor model may have a negative impact on the merged model.
+In this example, we will demonstrate how to use the SimpleAverageAlgorithm
class from the fusion_bench.method
module.
+This algorithm is used to merge multiple models by averaging their parameters.
from fusion_bench.method.simple_average import SimpleAverageAlgorithm
+
+# Instantiate the SimpleAverageAlgorithm
+# This algorithm will be used to merge multiple models by averaging their parameters.
+algorithm = SimpleAverageAlgorithm()
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to merge.
+# The models should all have the same architecture.
+models = [...]
+
+# Run the algorithm on the models.
+# This will return a new model that is the result of averaging the parameters of the input models.
+merged_model = algorithm.run(models)
+
The run
method of the SimpleAverageAlgorithm
class takes a list of models as input and returns a new model.
+The new model's parameters are the average of the parameters of the input models.
+This is useful in scenarios where you have trained multiple models and want to combine them into a single model that hopefully performs better than any individual model.
Configuration template for the Simple Averaging algorithm:
+ +use the following command to run the Simple Averaging algorithm:
+ +
SimpleAverageAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
, SimpleProfilerMixin
fusion_bench/method/simple_average.py
run(modelpool)
+
+¶Fuse the models in the given model pool using simple averaging.
+This method iterates over the names of the models in the model pool, loads each model, and appends it to a list. +It then returns the simple average of the models in the list.
+ + +Parameters:
+modelpool
+¶Union[BaseModelPool, Dict[str, Module]]
)
+ –
+ The pool of models to fuse.
+Returns:
+The fused model obtained by simple averaging.
+fusion_bench/method/simple_average.py
Ensemble methods are simple and effective ways to improve the performance of machine learning models. +They combine the outputs of multiple models to create a stronger model.
+from fusion_bench.method import EnsembleAlgorithm
+
+# Instantiate the EnsembleAlgorithm
+algorithm = EnsembleAlgorithm()
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to ensemble.
+models = [...]
+
+# Run the algorithm on the models.
+merged_model = algorithm.run(models)
+
Configuration template for the ensemble algorithm:
+ +create a simple ensemble of CLIP-ViT models for image classification tasks.
+fusion_bench \
+ method=ensemble/simple_ensemble \
+ modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
+ taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8
+
SimpleEnsembleAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
fusion_bench/method/ensemble.py
run(modelpool)
+
+¶Run the simple ensemble algorithm on the given model pool.
+ + +Parameters:
+modelpool
+¶BaseModelPool | List[Module]
)
+ –
+ The pool of models to ensemble.
+Returns:
+EnsembleModule
–
+ The ensembled model.
+fusion_bench/method/ensemble.py
Here we present the taxonomy for the SMILE upscaling method following "A Survey on Model MoErging" by Yadav et al. (2024) 2.
++ | + | + | + | + | + |
---|---|---|---|---|---|
Expert Training | +Standard | +Expert Data | +Private | +Routing Dataset | +None | +
Input Granularity | +Step | +Depth Granularity | +Module | +Expert Selection | +Sparse | +
Expert Aggregation | +Output | +Generalization | +In-Distribution | +User Dataset | +Zero-Shot | +
The SMILE upscaling method offers several configuration options, which are located in the config/method/
directory.
nn.Module
Upscaling:
+ This configuration is designed for upscaling any neural network module (nn.Module
).Each configuration file contains detailed parameters and options that can be adjusted to meet the specific needs of your model and application.
+name: smile_upscaling
+
+# merge device on cuda can accelerate the SVD computation
+device: cpu
+# device to compute svd
+upscaling_accelerator: cuda
+full_matrices: true # set to false if you are sure k < rank
+
+gate_k: 1
+k: 128
+top_k: 1
+
+routing_use_diff: true
+# average the remaining part, if this is set the False, the remaining part will kept as base model (the pretrained model)
+average_experts: false
+
+# path to save/load the model
+model_path: null
+
name: smile_mistral_upscaling
+
+device: cpu
+accelerator: cuda
+
+# path to save/load the model
+model_path: null
+model_dtype: float16
+
+num_experts_per_tok: 1
+rank_of_router: 8
+rank_of_expert: 512
+
Evaluate single fine-tuned models and save the results to outputs/ViT-B-32/single-task/
and outputs/ViT-L-14/single-task/
for CLIP-ViT-B/32 and CLIP-ViT-L/14 models, respectively.
# evaluate singlue fine-tuned models
+for task in sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd
+do
+ fusion_bench method=dummy \
+ modelpool=clip-vit-base-patch32_individual \
+ modelpool.models.0.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/single-task/clip-vit-base-patch32_${task}.json"
+done
+
+# if you have multiple GPUs, you can run the following code to evaluate the CLIP-ViT-L/14 models in parallel
+# evaluate singlue fine-tuned models clip-vit-large
+tasks=(sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd)
+CUDA_DEVICES=(0 1 2 3 4 5 6 7) # List of CUDA devices to use
+
+for i in "${!CUDA_DEVICES[@]}"; do
+ task=${tasks[$i]}
+ CUDA_VISIBLE_DEVICES=${CUDA_DEVICES[$i]} fusion_bench method=dummy \
+ modelpool=clip-vit-large-patch14_individual \
+ modelpool.models.0.path=tanganke/clip-vit-large-patch14_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14 \
+ report_save_path="outputs/ViT-L-14/single-task/clip-vit-large-patch14_${task}.json" &
+done
+
Upscale eight CLIP-ViT-B/32 models with SMILE, each CLIP-ViT-B/32 model is trained on a downstream task.
+gate_k=16
+k=32
+fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=CLIPVisionModelPool/clip-vit-base-patch32_TA8 \
+ taskpool=CLIPVisionModelTaskPool/clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+
Hyperparameter search for SMILE upscaling. Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32.ipynb
.
for gate_k in 1 2 4 8 16 32 64 128 256 512 768; do
+ for k in 4 8 16 32 64 128 -1; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Ablations on number of experts per token (Top-K). Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32-ablations-topk.ipynb
.
gate_k=16
+k=32
+for top_k in 1 2 4
+do
+fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/ablation/gate_k\=${gate_k}_k\=${k}.json"
+done
+
hyperparameter search for SMILE upscaling. Pre-run results can be found in examples/smile_upscaling/clip-vit-large-patch14.ipynb
.
for gate_k in 1 2 4 8 16 32 64 128; do
+ for k in 4 8 16 32 64 128 -1; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14 \
+ report_save_path="outputs/ViT-B-32/eight_tasks/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Hyperparameter search for full fine-tuned and lora fine-tuned Flan-T5 models.
+Pre-run results can be found in examples/smile_upscaling/flan-t5-base.ipynb
and examples/smile_upscaling/flan-t5-base-lora16.ipynb
.
# hyperparameter search for full fine-tuned flan-t5-base
+for gate_k in 4 8 16 32; do
+ for k in 16 32 64 128; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cpu \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=flan-t5-base_glue \
+ taskpool=flan-t5_glue_text_generation \
+ report_save_path="outputs/flan-t5-base/glue_text_generation/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
+# hyperparameter search for lora fine-tuned flan-t5-base
+for gate_k in 2 4 8; do
+ for k in 4 8 16; do
+ fusion_bench \
+ method=smile_upscaling \
+ method.device=cuda \
+ method.gate_k=$gate_k method.k=$k \
+ modelpool=flan-t5-base_glue_lora16 \
+ taskpool=flan-t5_glue_text_generation \
+ report_save_path="outputs/flan-t5-base_lora16/glue_text_generation/gate_k\=${gate_k}_k\=${k}.json"
+ done
+done
+
Here we upscale several Mistral-7B models using SMILE. The models are trained on different tasks and are used as experts in the SMILE upscaling.
+We first provide an example of the upscaled model, where we upscale the linear layers of the original Mistral model into a SMILE linear layer.
+import torch
+from accelerate import init_empty_weights
+from transformers import AutoConfig
+
+from fusion_bench.models.modeling_smile_mistral import (
+ SmileMistralConfig,
+ SmileMistralForCausalLM,
+)
+
+
+config = AutoConfig.from_pretrained(
+ "mistralai/Mistral-7B-v0.1"
+)
+config = SmileMistralConfig(
+ num_experts_per_tok=1,
+ rank_of_router=8,
+ rank_of_expert=8,
+ num_local_experts=3,
+ **config.to_dict()
+)
+with init_empty_weights():
+ model = SmileMistralForCausalLM(config)
+model.to(dtype=torch.float16).to_empty(device="cuda")
+
The model architecture is as follows:
+SmileMistralForCausalLM(
+ (model): SmileMistralModel(
+ (embed_tokens): Embedding(32000, 4096)
+ (layers): ModuleList(
+ (0-31): 32 x SmileMistralDecoderLayer(
+ (self_attn): SmileMistralAttention(
+ (q_proj): SingularMoELinear(in_features=4096, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (k_proj): SingularMoELinear(in_features=4096, out_features=1024, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (v_proj): SingularMoELinear(in_features=4096, out_features=1024, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (o_proj): SingularMoELinear(in_features=4096, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (rotary_emb): MistralRotaryEmbedding()
+ )
+ (mlp): SmileMistralMLP(
+ (gate_proj): SingularMoELinear(in_features=4096, out_features=14336, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (up_proj): SingularMoELinear(in_features=4096, out_features=14336, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (down_proj): SingularMoELinear(in_features=14336, out_features=4096, num_local_experts=3, num_experts_per_tok=1, rank_of_router=8, rank_of_expert=8)
+ (act_fn): SiLU()
+ )
+ (input_layernorm): MistralRMSNorm()
+ (post_attention_layernorm): MistralRMSNorm()
+ )
+ )
+ (norm): MistralRMSNorm()
+ )
+ (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
+)
+
Knowing the model architecture, we can upscale the Mistral-7B models using the following steps:
+Prepare the following 4 configuration files in configs/modelpool
:
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: meta-math/MetaMath-Mistral-7B
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: cognitivecomputations/dolphin-2.1-mistral-7b
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: uukuguy/speechless-code-mistral-7b-v1.0
+
+dtype: float16
+
type: AutoModelForCausalLMPool
+models:
+- name: _pretrained_
+ path: mistralai/Mistral-7B-v0.1
+- name: expert_1
+ path: meta-math/MetaMath-Mistral-7B
+- name: expert_2
+ path: cognitivecomputations/dolphin-2.1-mistral-7b
+- name: expert_3
+ path: uukuguy/speechless-code-mistral-7b-v1.0
+
+dtype: float16
+
Upscale Mistral-7B models. The upscaled models are saved in outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
.
function model_fusion() {
+ output_dir=outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
+ fusion_bench \
+ method=smile_mistral_upscaling \
+ method.rank_of_router=$gate_k method.rank_of_expert=$k \
+ method.model_path=${output_dir} \
+ modelpool=smile_mistral_exp_v${version} \
+ modelpool.dtype=float32 \
+ taskpool=dummy \
+ report_save_path="${output_dir}/model_info.json"
+}
+
+gate_k=8
+for k in 8 16 32 64 128 256 384 512; do
+ for version in 1 2 3 4; do
+ model_fusion
+ done
+done
+
Use lm-evaluation-harness to evaluate the models. We use the default configurations for each task.
+# For some GPUs, the following environment variables need to be set
+# export NCCL_P2P_DISABLE="1"
+# export NCCL_IB_DISABLE="1"
+
+function model_eval() {
+ output_dir=outputs/mistral/gate_k-${gate_k}_k-${k}/version_${version}
+
+ # Check if ${output_dir}/${task}.json exists as a directory and return if it does
+ if [ -d "${output_dir}/${task}.json" ]; then
+ echo "Directory ${output_dir}/${task}.json already exists. Skipping evaluation."
+ return
+ fi
+
+ lm_eval --model hf \
+ --model_args pretrained=${output_dir},dtype="float16",parallelize=True \
+ --tasks ${task} \
+ --output_path ${output_dir}/${task}.json \
+ --batch_size 6
+}
+
The above function can be used to evaluate the models on specified task.
+Pre-run results can be found in examples/smile_upscaling/mistral_gsm8k.ipynb
.
# Evaluate all the models on GSM8K task
+gate_k=8
+task=gsm8k
+for k in 8 16 32 64 128 256 384 512; do
+ for version in 1 2 3 4; do
+ model_eval
+ done
+done
+
+# Evaluate all M0;123 models on truthfulqa gsm8k arc_challenge mmlu
+k=8
+version=4
+for task in truthfulqa gsm8k arc_challenge mmlu; do
+ model_eval
+done
+
The reported metrics are:
+Pre-run results can be found in examples/smile_upscaling/clip-vit-base-patch32_single-task_projection-merging.ipynb
.
# project into different subspaces
+for task in sun397 stanford-cars resisc45 eurosat svhn gtsrb mnist dtd
+do
+ # Space I
+ CUDA_VISIBLE_DEVICES=0 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=low method.k=-1 method.full_matrices=false \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/single-task/projection_merging_zone1_${task}.json" &
+
+ # Space II
+ CUDA_VISIBLE_DEVICES=1 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=high method.k=-1 method.full_matrices=false \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/single-task/projection_merging_zone2_${task}.json" &
+
+ # Space III
+ CUDA_VISIBLE_DEVICES=2 fusion_bench \
+ method=singular_projection_merging \
+ method.device=cuda method.rank=high method.k=-1 method.full_matrices=true \
+ modelpool=clip-vit-base-patch32_single_finetuned \
+ modelpool.models.1.name=${task} \
+ modelpool.models.1.path=tanganke/clip-vit-base-patch32_${task} \
+ taskpool=clip-vit-classification_TA8 \
+ report_save_path="outputs/ViT-B-32/single-task/projection_merging_zone23_${task}.json" &
+ wait
+done
+
SmileUpscalingAlgorithm
+
+
+¶
+ Bases: SimpleProfilerMixin
, BaseAlgorithm
fusion_bench/method/smile_upscaling/smile_upscaling.py
358 +359 +360 +361 +362 +363 +364 +365 +366 +367 +368 +369 +370 +371 +372 +373 +374 +375 +376 +377 +378 +379 +380 +381 +382 +383 +384 +385 +386 +387 +388 +389 +390 +391 +392 +393 +394 +395 +396 +397 +398 +399 +400 +401 +402 +403 +404 +405 +406 +407 +408 +409 +410 +411 +412 +413 +414 +415 +416 +417 +418 +419 +420 +421 +422 +423 +424 +425 +426 +427 +428 +429 +430 +431 +432 +433 +434 +435 +436 +437 +438 +439 +440 +441 +442 +443 +444 +445 +446 +447 +448 +449 +450 +451 +452 +453 +454 +455 +456 +457 +458 +459 +460 +461 +462 +463 +464 +465 +466 +467 +468 +469 +470 +471 +472 +473 +474 +475 +476 +477 +478 +479 +480 +481 +482 +483 +484 +485 +486 +487 +488 +489 +490 +491 +492 +493 +494 +495 +496 +497 +498 +499 +500 +501 +502 +503 +504 +505 +506 +507 +508 +509 +510 +511 +512 +513 +514 +515 +516 +517 +518 +519 +520 +521 +522 +523 +524 +525 +526 +527 +528 +529 +530 +531 +532 +533 +534 +535 +536 +537 +538 +539 +540 +541 +542 +543 +544 +545 +546 +547 +548 +549 +550 +551 +552 +553 +554 +555 +556 +557 +558 +559 +560 +561 +562 +563 +564 +565 +566 +567 +568 +569 +570 +571 +572 +573 |
|
__init__(*, device='cuda', upscaling_accelerator=None, full_matrices=True, gate_k=256, k=256, top_k=1, routing_use_diff=True, average_experts=False, model_path=None, **kwargs)
+
+¶Initialize the SmileUpscalingAlgorithm.
+ + +Parameters:
+device
+¶str
, default:
+ 'cuda'
+)
+ –
+ The device to perform the computation on.
+upscaling_accelerator
+¶str
, default:
+ None
+)
+ –
+ The device to perform the SVD computation on.
+full_matrices
+¶bool
, default:
+ True
+)
+ –
+ Whether to compute the full-sized U and V matrices.
+gate_k
+¶int
, default:
+ 256
+)
+ –
+ The number of singular values to keep for the gate.
+k
+¶int
, default:
+ 256
+)
+ –
+ The number of singular values to keep for the experts.
+top_k
+¶int
, default:
+ 1
+)
+ –
+ The number of top experts to select.
+routing_use_diff
+¶bool
, default:
+ True
+)
+ –
+ Whether to use weight differences for routing.
+average_experts
+¶bool
, default:
+ False
+)
+ –
+ Whether to average the experts.
+model_path
+¶str
, default:
+ None
+)
+ –
+ The path to save/load the model.
+**kwargs
+¶Additional arguments.
+fusion_bench/method/smile_upscaling/smile_upscaling.py
merge(pretrained_model, finetuned_models, in_place=True)
+
+¶Merges the pretrained model with the fine-tuned models to create an upscaled model.
+ + +Parameters:
+pretrained_model
+¶Module
)
+ –
+ The pretrained model.
+finetuned_models
+¶List[Module]
)
+ –
+ A list of fine-tuned models.
+in_place
+¶bool
, default:
+ True
+)
+ –
+ If True, modifies the pretrained model in place. Otherwise, creates a copy.
+Returns:
+nn.Module: The merged model.
+fusion_bench/method/smile_upscaling/smile_upscaling.py
run(modelpool)
+
+¶Executes the upscaling process.
+ + +Parameters:
+modelpool
+¶ModelPool
)
+ –
+ The pool of models to be used for upscaling.
+Returns:
+nn.Module: The upscaled model.
+fusion_bench/method/smile_upscaling/smile_upscaling.py
A. Tang et. al. SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models. Aug, 2024. +https://arxiv.org/abs/2408.10174 ↩
+Yadav, Prateek, et al. "A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning." arXiv preprint arXiv:2408.07057 (2024). ↩
+In the rapidly advancing field of machine learning, multi-task learning has emerged as a powerful paradigm, allowing models to leverage information from multiple tasks to improve performance and generalization. One intriguing method in this domain is Task Arithmetic, which involves the combination of task-specific vectors derived from model parameters.
+ +Task Vector. A task vector is used to encapsulate the adjustments needed by a model to specialize in a specific task. +It is derived from the differences between a pre-trained model's parameters and those fine-tuned for a particular task. +Formally, if \(\theta_i\) represents the model parameters fine-tuned for the i-th task and \(\theta_0\) denotes the parameters of the pre-trained model, the task vector for the i-th task is defined as:
+This representation is crucial for methods like Task Arithmetic, where multiple task vectors are aggregated and scaled to form a comprehensive multi-task model.
+Task Arithmetic1 begins by computing a task vector \(\tau_i\) for each individual task, using the set of model parameters \(\theta_0 \cup \{\theta_i\}_i\) where \(\theta_0\) is the pre-trained model and \(\theta_i\) are the fine-tuned parameters for i-th task. +These task vectors are then aggregated to form a multi-task vector. +Subsequently, the multi-task vector is combined with the pre-trained model parameters to obtain the final multi-task model. +This process involves scaling the combined vector element-wise by a scaling coefficient (denoted as \(\lambda\)), before adding it to the initial pre-trained model parameters. +The resulting formulation for obtaining a multi-task model is expressed as
+The choice of the scaling coefficient \(\lambda\) plays a crucial role in the final model performance. Typically, \(\lambda\) is chosen based on validation set performance.
+To use the Task Arithmetic algorithm, you can use the TaskArithmeticAlgorithm
class from the fusion_bench.method
module.
from fusion_bench.method.task_arithmetic import TaskArithmeticAlgorithm
+from omegaconf import DictConfig
+
+# Instantiate the TaskArithmeticAlgorithm
+method_config = {'name': 'task_arithmetic', 'scaling_factor': 0.5}
+algorithm = TaskArithmeticAlgorithm(DictConfig(method_config))
+
+# Assume we have a dict of PyTorch models (nn.Module instances) that we want to merge.
+# The models should all have the same architecture.
+# the dict must contain the pre-trained model with the key '_pretrained_', and arbitrary number of fine-tuned models.
+models = {'_pretrained_': nn.Linear(10,10), 'model_1': nn.Linear(10,10), 'model_2': nn.Linear(10,10)}
+
+# Run the algorithm on the models.
+# This will return a new model that is the result of task arithmetic on the input models.
+merged_model = algorithm.run(models)
+
Configuration template for the Task Arithmetic algorithm:
+name: task_arithmetic
+scaling_factor: 0.5 # Scaling factor for task vectors
+
Use the following command to run the Task Arithmetic algorithm:
+ +For example, to run the Task Arithmetic algorithm on two models with scaling factor 0.5:
+fusion_bench method=task_arithmetic \
+ method.scaling_factor=0.5 \
+ modelpool=clip-vit-base-patch32_svhn_and_mnist \
+ taskpool=clip-vit-base-patch32_svhn_and_mnist
+
where the configuration for the model pool is:
+type: huggingface_clip_vision
+# the modelpool must contain the pre-trained model with the name '_pretrained_',
+# and arbitrary number of fine-tuned models.
+models:
+ - name: _pretrained_
+ path: openai/clip-vit-base-patch32
+ - name: svhn
+ path: tanganke/clip-vit-base-patch32_svhn
+ - name: mnist
+ path: tanganke/clip-vit-base-patch32_mnist
+
and the configuration for the task pool:
+type: clip_vit_classification
+
+dataset_type: huggingface_image_classification
+tasks:
+ - name: svhn
+ dataset:
+ type: instantiate
+ name: svhn
+ object:
+ _target_: datasets.load_dataset
+ _args_:
+ - svhn
+ - cropped_digits
+ split: test
+ - name: mnist
+ dataset:
+ name: mnist
+ split: test
+
+...
+
TaskArithmeticAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
, SimpleProfilerMixin
Task Arithmetic Algorithm for model fusion.
+This class implements the Task Arithmetic method for fusing models. It inherits from +BaseModelFusionAlgorithm and SimpleProfilerMixin to provide the necessary functionality +for model fusion and profiling.
+ + +Attributes:
+scaling_factor
+ (int
)
+ –
+ The factor by which the task vectors will be scaled before merging.
+fusion_bench/method/task_arithmetic/task_arithmetic.py
75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 |
|
_config_mapping = BaseAlgorithm._config_mapping | {'scaling_factor': 'scaling_factor'}
+
+
+ class-attribute
+ instance-attribute
+
+
+¶
scaling_factor = scaling_factor
+
+
+ instance-attribute
+
+
+¶
__init__(scaling_factor)
+
+¶Initializes the TaskArithmeticAlgorithm with the given scaling factor.
+ + +Parameters:
+scaling_factor
+¶int
)
+ –
+ The factor by which the task vectors will be scaled before merging.
+fusion_bench/method/task_arithmetic/task_arithmetic.py
run(modelpool)
+
+¶Runs the Task Arithmetic Algorithm to fuse models in the given model pool.
+ + +Parameters:
+modelpool
+¶Union[BaseModelPool, Dict[str, Module]]
)
+ –
+ The pool of models to fuse.
+Returns:
+nn.Module: The pre-trained model with the merged task vectors.
+fusion_bench/method/task_arithmetic/task_arithmetic.py
(ICLR 2023) Editing Models with Task Arithmetic. http://arxiv.org/abs/2212.04089 ↩
+(ICLR 2024) AdaMerging: Adaptive Model Merging for Multi-Task Learning. http://arxiv.org/abs/2310.02575 ↩
+(NIPS 2023 Oral) Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard, “Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models,” doi: 10.48550/arXiv.2305.12827. ↩
+Ties-Merging1 represents a novel and structured approach to consolidating multiple task-specific models into a single, efficient multi-task model. This method employs a sequence of deliberate steps to systematically merge task vectors, ensuring that the final model effectively integrates the strengths of each individual task-specific model and resolves potential conflicts between them.
+The Ties-Merging algorithm operates through three primary steps:
+Given the final merged task vector \(\tau\), the ultimate model is determined similarly to the method used in task arithmetic. The formulation is expressed as:
+where \(\lambda\) is a hyperparameter chosen based on the validation set to ensure the best-performing model.
+By following these structured steps, Ties-Merging effectively integrates multiple task-specific models into a unified multi-task model, balancing the contributions of each task to enhance overall performance. The process ensures that the final model retains the benefits of the pre-trained model while optimally incorporating the diverse knowledge contained within the individual task-specific models.
+In the above figure, we show the average performance of Task Arithmetic and Ties-Merging merged models as the scaling coefficient varies. Subfigure (a), (b), (c), and (d) show the results of CLIP-ViT-B/32, CLIP-ViT-L/14, Flan-T5-base (LoRA fine-tuned), and Flan-T5-large (LoRA fine-tuned), respectively. It is evident that the merged multi-task model hits a peak in average performance across various tasks when the scaling coefficient is set around 0.3. This value was empirically selected as the scaling coefficient in our experiments. As we increase the scaling coefficient beyond this point, the average performance of the model begins to decline, eventually even falling below the level of the pre-trained model’s original performance. This suggests that too high of a scaling coefficient can have a negative impact on the knowledge that the pre-trained model initially possessed, emphasizing the importance of calibrating the scaling coefficient parameter \(\lambda\) to avoid diminishing the model’s existing strengths.
+Configuration template for the Ties-Merging algorithm:
+name: ties_merging
+# Scaling factor $\lambda$
+scaling_factor: 0.5
+threshold: 0.5
+# List of keys to remove from the state dict, default is empty
+remove_keys: []
+# Function to merge the models, default is sum. Options are 'sum', 'mean', and 'max'
+merge_func: sum
+
Use the following command to run the Ties-Merging algorithm:
+ +
TiesMergingAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
TiesMergingAlgorithm is a class for fusing multiple models using the TIES merging technique.
+ + +Attributes:
+scaling_factor
+ (float
)
+ –
+ The scaling factor to apply to the merged task vector.
+threshold
+ (float
)
+ –
+ The threshold for resetting values in the task vector.
+remove_keys
+ (List[str]
)
+ –
+ List of keys to remove from the state dictionary.
+merge_func
+ (Literal['sum', 'mean', 'max']
)
+ –
+ The merge function to use for disjoint merging.
+fusion_bench/method/ties_merging/ties_merging.py
17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 |
|
_config_mapping = BaseAlgorithm._config_mapping | {'scaling_factor': 'scaling_factor', 'threshold': 'threshold', 'remove_keys': 'remove_keys', 'merge_func': 'merge_func'}
+
+
+ class-attribute
+ instance-attribute
+
+
+¶
merge_func = merge_func
+
+
+ instance-attribute
+
+
+¶
remove_keys = remove_keys
+
+
+ instance-attribute
+
+
+¶
scaling_factor = scaling_factor
+
+
+ instance-attribute
+
+
+¶
threshold = threshold
+
+
+ instance-attribute
+
+
+¶
__init__(scaling_factor, threshold, remove_keys, merge_func, **kwargs)
+
+¶Initialize the TiesMergingAlgorithm with the given parameters.
+ + +Parameters:
+scaling_factor
+¶float
)
+ –
+ The scaling factor to apply to the merged task vector.
+threshold
+¶float
)
+ –
+ The threshold for resetting values in the task vector.
+remove_keys
+¶List[str]
)
+ –
+ List of keys to remove from the state dictionary.
+merge_func
+¶Literal['sum', 'mean', 'max']
)
+ –
+ The merge function to use for disjoint merging.
+**kwargs
+¶Additional keyword arguments for the base class.
+fusion_bench/method/ties_merging/ties_merging.py
run(modelpool)
+
+¶Run the TIES merging algorithm to fuse models in the model pool.
+ + +Parameters:
+modelpool
+¶BaseModelPool | Dict[str, Module]
)
+ –
+ The model pool containing the models to fuse.
+Returns:
+nn.Module: The fused model.
+fusion_bench/method/ties_merging/ties_merging.py
(NIPS 2023) Resolving Interference When Merging Models. http://arxiv.org/abs/2306.01708 ↩
+This method is designed to handle a wide range of tasks by segregating shared information and task-specific knowledge. +It dynamically combines these elements based on the input samples.
+The Weight-Ensembling MoE module consists of three main components: the router, the pre-trained MLP weights, and a collection of task vectors. +The router, which is an MLP, processes the input data and generates routing weights. These weights determine how the knowledge from different tasks is combined. +The pre-trained MLP weights are crucial as they have been trained to recognize a wide range of data patterns. +The task vectors represent the differences between the MLPs that have been fine-tuned for specific tasks and the pre-trained ones, capturing the unique adjustments made to optimize them for specific tasks. +The routing weights are averaged across the input tokens, and these weights are used to select task vectors from a dictionary matrix. +These task vectors are then added to the pre-trained MLP weights to create input-conditioned weights.
+Algorithm Requirements:
+Method | +Access to labeled tasks data | +Access to validation data (labeled) | +Test time adaptation | +
---|---|---|---|
Fisher Merging | +Yes (Estimate Fisher information matrix) | +No | +No | +
RegMean | +Yes (compute Gram Matrix) | +No | +No | +
Task Arithmetic | +No | +Yes (select sacling factor) | +No | +
Ties-Merging | +No | +Yes (select sacling factor) | +No | +
AdaMerging | +No | +No | +Yes | +
Ours | +No | +No | +Yes | +
L. Shen, A. Tang, E. Yang et al. Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging. Oct, 2024.3
+ + + + +Tip for reducing the parameter count
+Here we present the parameter count for the method outlined in the original paper1. +An effective strategy to minimize the number of parameters involves employing Singular Value Decomposition (SVD) to compress the task vectors. +This approach significantly cuts down on the number of parameters while only marginally impacting performance. +For additional information, please refer to the Twin-Merging paper2. +Which not only reduces the number of parameters but also conducts extensive experiments to demonstrate the effectiveness of data-adaptive merging on language domain.
+Here is the number of parameters compared to a single pre-trained model (OpenCLIP CLIP-ViT-B/32):
+Method | +Trainable Parameters | +Total Parameters | +Paremeters Reduced by Merging | +
---|---|---|---|
Single Pre-trained | +113.45M (100%) | +113.45M | +- | +
WEMoE (2-layer, 1 task) | +7.10M (4.00%) | +177.21M | +- | +
WEMoE (2-layer, 2 tasks) | +7.11M (3.04%) | +233.89M | +2*113.45-233.89=-6.99M | +
WEMoE (2-layer, 3 tasks) | +7.11M (2.45%) | +290.57M | +3*113.45-290.57=49.78M | +
WEMoE (2-layer, 4 tasks) | +7.12M (2.02%) | +347.25M | +4*113.45-347.25=106.55M | +
WEMoE (2-layer, 5 tasks) | +7.13M (1.77%) | +403.93M | +5*113.45-403.93=163.32M | +
WEMoE (2-layer, 6 tasks) | +7.14M (1.55%) | +460.61M | +6*113.45-460.61=220.09M | +
WEMoE (2-layer, 7 tasks) | +7.15M (1.38%) | +517.28M | +7*113.45-517.28=276.87M | +
WEMoE (2-layer, 8 tasks) | +7.16M (1.25%) | +573.96M | +8*113.45-573.96=333.64M | +
The number of parameter count of HuggingFace CLIP vision models (of type transformers.models.clip.modeling_clip.CLIPVisionModel
) are different from the OpenCLIP models downloaded from the task arithmetic repo, because the OpenCLIP models (of type src.modeling.ImageEncoder
) include the embedding layer for text tokens, while the HuggingFace CLIP vision models do not.
+Therefore, the relative parameter count of the upscaled model using Transformer CLIP vision models will be larger than the OpenCLIP models.
ImageEncoder( # (1)
+ (model): CLIP(
+ (visual): VisualTransformer( # (2)
+ (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
+ (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (transformer): Transformer(
+ (resblocks): ModuleList(
+ (0-11): 12 x ResidualAttentionBlock(
+ (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (attn): MultiheadAttention(
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+ )
+ (ln_attn): Identity()
+ (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (mlp): Sequential(
+ (c_fc): Linear(in_features=768, out_features=3072, bias=True)
+ (ln): Identity()
+ (gelu): QuickGELU()
+ (c_proj): Linear(in_features=3072, out_features=768, bias=True)
+ )
+ )
+ )
+ )
+ (ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+ (token_embedding): Embedding(49408, 512) # (3)
+ (ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
+ )
+)
+
CLIPVisionModel( # (1)
+ (vision_model): CLIPVisionTransformer(
+ (embeddings): CLIPVisionEmbeddings(
+ (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
+ (position_embedding): Embedding(50, 768)
+ )
+ (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (encoder): CLIPEncoder(
+ (layers): ModuleList(
+ (0-11): 12 x CLIPEncoderLayer(
+ (self_attn): CLIPAttention(
+ (k_proj): Linear(in_features=768, out_features=768, bias=True)
+ (v_proj): Linear(in_features=768, out_features=768, bias=True)
+ (q_proj): Linear(in_features=768, out_features=768, bias=True)
+ (out_proj): Linear(in_features=768, out_features=768, bias=True)
+ )
+ (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ (mlp): CLIPMLP(
+ (activation_fn): QuickGELUActivation()
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
+ )
+ (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+ )
+ )
+ (post_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+ )
+)
+
In the below figure, we show the performance of the merged models with varying numbers of steps. +Figure (b) shows the performance of the merged WEMoE models with varying number of steps. +In Figure (a), we merge CLIP-ViT-B/32 models with different learning rate configurations. +We observe that the performance of the merged model shows an upward trend with an increase in the number of training steps, and it converges rapidly, reaching a high accuracy level in just 200 steps. +Furthermore, the influence of different learning rates is not significant, suggesting that our method is insensitive to the learning rate parameter. This is a desirable property as it reduces the need for hyperparameter tuning.
+ +Table: Parameter comparison of WEMoE (1-layer) and WEMoE (2-layer) on CLIP-ViT-B/32 models (OpenCLIP).
+Method | +Number of Trainable Parameters | +
---|---|
AdaMerging (layer-wise) | +1.3K | +
WEMoE (1-layer) | +73.8K (0.01%) | +
WEMoE (2-layer) | +7.16M (1.25%) | +
Table: Ablation study of the router depth on the performance of the up-scaled CLIP-ViT-B/32 models (OpenCLIP).
+Method | +SUN397 | +CARS | +RESISC45 | +EuroSAT | +SVHN | +GRSRB | +MNIST | +DTD | +Avg. | +
---|---|---|---|---|---|---|---|---|---|
AdaMerging (layer-wise) | +66.6 | +68.3 | +82.4 | +92.5 | +86.5 | +93.7 | +97.7 | +61.1 | +80.9 | +
WEMoE (1-layer) | +73.2 | +76.7 | +93.8 | +98.6 | +95.7 | +98.6 | +99.5 | +74.5 | +88.3 | +
WEMoE (2-layer) | +74.1 | +77.4 | +93.7 | +99.1 | +96.2 | +98.9 | +99.6 | +76.4 | +89.4 | +
To explore the influence of router depth on the performance of the scaled-up model, we perform an ablation study where the router depth is varied. In WEMoE modules, the router is implemented as a multi-layer perceptron (MLP).
+In the above two Tables, we present additional findings to support our argument. We compare the number of trainable parameters and performance between WEMoE (1-layer) and WEMoE (2-layer). The data reveal that WEMoE (1-layer) possesses 73.8K trainable parameters, which constitute only 0.01% of the total parameters in the merged model. Notably, the performance of WEMoE (1-layer) is significantly better than AdaMerging and nearly matches that of WEMoE (2-layer) across all tasks. This evidence underscores our claim that the MoE design is crucial for performance enhancement.
+multi-task model fusion experiment on eight image classification tasks.
+# merge eight CLIP-ViT-B/32 models using WE MoE
+fusion_bench \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.use_grad_accumulate=false \
+ method.save_checkpoint=outputs/clip-vit-base-patch32_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ modelpool=clip-vit-base-patch32_TA8 \
+ taskpool=clip-vit-classification_TA8
+
merge eight CLIP-ViT-L/14 models:
+# merge eight CLIP-ViT-L/14 models using WE MoE, fine-tune the routers
+fusion_bench print_config=false \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.use_grad_accumulate=true \
+ method.save_checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ method.batch_size=4 method.devices=4 \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=dummy &&
+
+# load the checkpoint and evaluate the model
+fusion_bench \
+ method=weight_ensembling_moe \
+ method.name=clip_weight_ensembling_moe \
+ method.checkpoint=outputs/clip-vit-large-patch14_TA8_weight_ensembling_moe_checkpoint.ckpt \
+ modelpool=clip-vit-large-patch14_TA8 \
+ taskpool=clip-vit-classification_TA8 \
+ taskpool.clip_model=openai/clip-vit-large-patch14
+
we_moe
+
+
+¶
WeightEnsemblingMoEAlgorithm
+
+
+¶
+ Bases: ModelFusionAlgorithm
Algorithm for fusing models using Weight Ensembling Mixture of Experts (MoE).
+This class provides methods for constructing the MoE model, performing test-time adaptation, +and running the fusion process.
+ + +Attributes:
+_fabric
+ (Fabric
)
+ –
+ The fabric for distributed training.
+modelpool
+ (ModelPool
)
+ –
+ The pool of models to be fused.
+profiler
+ (SimpleProfiler
)
+ –
+ The profiler for measuring performance.
+fusion_bench/method/we_moe/we_moe.py
37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +213 +214 +215 +216 +217 +218 +219 +220 +221 +222 +223 +224 +225 +226 +227 +228 +229 +230 +231 +232 +233 +234 +235 +236 +237 +238 +239 +240 +241 +242 +243 +244 +245 +246 +247 |
|
__init__(algorithm_config)
+
+¶Initialize the WeightEnsemblingMoEAlgorithm with the given configuration.
+ + +Parameters:
+algorithm_config
+¶DictConfig
)
+ –
+ The configuration for the algorithm.
+fusion_bench/method/we_moe/we_moe.py
compute_logits(module, batch, task)
+
+
+ abstractmethod
+
+
+¶Compute the logits for a given batch and task.
+ + +Parameters:
+module
+¶The model module to use for computing logits.
+batch
+¶The batch of data.
+task
+¶The task for which to compute logits.
+Returns:
+Tensor
( Tensor
+) –
+ The computed logits.
+fusion_bench/method/we_moe/we_moe.py
construct_moe_model()
+
+
+ abstractmethod
+
+
+¶Construct the Mixture of Experts model using the models in the model pool.
+ + +Returns:
+WeightEnsemblingMoE
( WeightEnsemblingMoE
+) –
+ The constructed MoE model.
+fusion_bench/method/we_moe/we_moe.py
get_shuffled_test_loader_iter(task)
+
+
+ abstractmethod
+
+
+¶Get an iterator for the shuffled test data loader for a specific task.
+ + +Parameters:
+task
+¶str
)
+ –
+ The task for which to get the test data loader.
+Returns:
+DataLoader
( DataLoader
+) –
+ The shuffled test data loader iterator.
+fusion_bench/method/we_moe/we_moe.py
load_checkpoint(model, checkpoint)
+
+
+ abstractmethod
+
+
+¶
on_test_time_adaptation_start()
+
+¶
run(modelpool)
+
+¶Run the WeightEnsemblingMoEAlgorithm to fuse models using Weight Ensembling Mixture of Experts.
+ + +Parameters:
+modelpool
+¶ModelPool
)
+ –
+ The pool of models to be fused.
+Returns:
+WeightEnsemblingMoE
–
+ The fused MoE model.
+fusion_bench/method/we_moe/we_moe.py
save_checkpoint(model, checkpoint)
+
+
+ abstractmethod
+
+
+¶
test_time_adaptation(module)
+
+¶Perform test-time adaptation for the given module.
+ + +Parameters:
+module
+¶WeightEnsemblingMoE
)
+ –
+ The MoE module to adapt.
+Returns:
+WeightEnsemblingMoE
–
+ The adapted MoE module.
+fusion_bench/method/we_moe/we_moe.py
139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 |
|
entropy_loss(logits)
+
+¶Compute the entropy loss of a set of logits.
+ + +Parameters:
+logits
+¶Tensor
)
+ –
+ The logits to compute the entropy loss of.
+Returns:
+Tensor
( Tensor
+) –
+ The entropy loss of the logits.
+fusion_bench/method/we_moe/we_moe.py
clip_we_moe
+
+
+¶
CLIPWeightEnsemblingMoEAlgorithm
+
+
+¶
+ Bases: WeightEnsemblingMoEAlgorithm
, CLIPClassificationMixin
CLIPWeightEnsemblingMoEAlgorithm is a class that implements the WeightEnsemblingMoEAlgorithm +for CLIP models. It extends the WeightEnsemblingMoEAlgorithm and CLIPClassificationMixin classes.
+ + +Attributes:
+modelpool
+ (CLIPVisionModelPool
)
+ –
+ The model pool containing the CLIP models.
+fusion_bench/method/we_moe/clip_we_moe.py
27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 |
|
compute_logits(module, batch, task)
+
+¶Compute the logits for the given batch and task.
+ + +Parameters:
+module
+¶The model module.
+batch
+¶The input batch.
+task
+¶The task name.
+Returns:
+Tensor
( Tensor
+) –
+ The computed logits.
+fusion_bench/method/we_moe/clip_we_moe.py
construct_moe_model()
+
+¶Construct the Mixture of Experts (MoE) model using the models in the model pool.
+ + +Returns:
+WeightEnsemblingMoE
( WeightEnsemblingMoE
+) –
+ The constructed MoE model.
+fusion_bench/method/we_moe/clip_we_moe.py
get_shuffled_test_loader_iter(tta_dataset)
+
+
+ cached
+
+
+¶Get an iterator for the shuffled test data loader.
+ + +Parameters:
+tta_dataset
+¶str
)
+ –
+ The name of the test-time adaptation dataset.
+Returns:
+Iterator
–
+ An iterator for the shuffled test data loader.
+fusion_bench/method/we_moe/clip_we_moe.py
load_checkpoint(model, checkpoint)
+
+¶
on_test_time_adaptation_start()
+
+¶Load the CLIP processor and construct the zero-shot classification head for each task.
+ + +
save_checkpoint(model, checkpoint)
+
+¶Anke Tang et.al. ICML 2024. Merging Multi-Task Models via Weight-Ensembling Mixture of Experts. http://arxiv.org/abs/2402.00433 ICML 2024. ↩
+Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y. Cheng, “Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging,” doi: 10.48550/arXiv.2406.15479. NeurIPS 2024. ↩
+L. Shen, A. Tang, E. Yang et al. Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging. Oct, 2024. ↩
+Weighted averaging, also known as weight-ensembling. +In the context of full fine-tuned models, the weights are averaged according to their respective performance weights. Concretely, this means that if we have \(n\) models with their respective weights \(\theta_i\) and model-wise weights \(w_i\), the weights of the final model \(\theta\) are computed as:
+Configuration template for the Weighted Averaging algorithm:
+name: weighted_average
+normalize: true # if true, the weights will be normalized before merging
+weights: # List of weights for each model
+ - 0.5
+ - 0.5
+
Use the following command to run the Weighted Averaging algorithm:
+ +The following command merges eight clip-ViT models using a weighted average approach.
+Because method.normalize
is set to true, the weights are normalized to sum to 1, thus equivalent to simple average.
fusion_bench \
+ method=weighted_average \
+ method.normalize=true \
+ method.weights=[0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3] \
+ modelpool=clip-vit-base-patch32_TA8_model_only \
+ taskpool=clip-vit-classification_TA8
+
Here is an example of how to use the Weighted Averaging algorithm to merge two LLama models. In particular, LLaMa models of the type transformers.LlamaForCausalLM
are merged using the Weighted Averaging algorithm.
fusion_bench \
+ method=weighted_average_for_llama \
+ method.merged_model_save_path=outputs/test_merged_llama_model \
+ modelpool=llama_for_causallm \
+ taskpool=dummy
+
or using the following configuration file config/llama_weighted_average.yaml
defaults:
+ - example_config
+ - override method: weighted_average_for_llama
+ - override modelpool: llama_for_causallm
+ - _self_
+
+modelpool:
+ models:
+ # the pre-trained model (base model) is optional
+ # if not provided, the first model will be used as the base model
+ - name: _pretrained_
+ path: meta-llama/Meta-Llama-3-8B
+ - name: expert_1
+ path: meta-llama/Meta-Llama-3-8B
+ - name: expert_2
+ path: meta-llama/Meta-Llama-3-8B-Instruct
+
+method:
+ normalize: true # if true, the weights will be normalized before merging
+ weights: # List of weights for each model
+ - 0.5
+ - 0.5
+ # if true, only the backbone of the model will be merged and the head will be keeped as the pre-trained model (if the pre-trained model is provided, otherwise the head of the first model will be used)
+ # if false, the whole model will be merged
+ backbone_only: true
+
+ merged_model_save_path: null
+ save_tokenizer: true
+ push_to_hub: false
+
WeightedAverageAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
, SimpleProfilerMixin
fusion_bench/method/weighted_average/weighted_average.py
33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 |
|
_config_mapping = BaseAlgorithm._config_mapping | {'normalize': 'normalize', 'weights': 'weights'}
+
+
+ class-attribute
+ instance-attribute
+
+
+¶
normalize = normalize
+
+
+ instance-attribute
+
+
+¶
verbose = verbose
+
+
+ instance-attribute
+
+
+¶
weights = weights
+
+
+ instance-attribute
+
+
+¶
__init__(normalize, weights, verbose=True, **kwargs)
+
+¶fusion_bench/method/weighted_average/weighted_average.py
run(modelpool)
+
+¶Fuses the models in the model pool using a weighted average approach.
+Parameters + modelpool (ModelPool): The pool of models to be fused.
+Raises + ValueError: If the number of weights does not match the number of models in the model pool.
+Returns + forward_model (torch.nn.Module): The resulting model after fusion.
+ +fusion_bench/method/weighted_average/weighted_average.py
WeightedAverageForLLama
+
+
+¶
+ Bases: BaseAlgorithm
A class to perform weighted averaging of LlaMa/Mistral models.
+ + + + + + +fusion_bench/method/weighted_average/llama.py
17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 |
|
__init__(normalize, weights, backbone_only, merged_model_save_path, save_tokenizer, push_to_hub, **kwargs)
+
+¶Initialize the WeightedAverageForLLama class with the given parameters.
+ + +Parameters:
+normalize
+¶bool
)
+ –
+ Whether to normalize the weights.
+weights
+¶List[float]
)
+ –
+ The weights for averaging the models.
+backbone_only
+¶bool
)
+ –
+ Whether to use only the backbone of the models.
+merged_model_save_path
+¶str
)
+ –
+ The path to save the merged model.
+save_tokenizer
+¶bool
)
+ –
+ Whether to save the tokenizer.
+push_to_hub
+¶bool
)
+ –
+ Whether to push the model to the hub.
+fusion_bench/method/weighted_average/llama.py
run(modelpool)
+
+¶Executes the weighted averaging of models in the provided model pool.
+ + +Parameters:
+modelpool
+¶LLamaForCausalLMPoolThe
)
+ –
+ pool of models to be averaged.
+Returns:
+base_model
–
+ The base model after merging the state dictionaries of the models in the pool.
+Raises:
+ValueError
+ –
+ If the number of weights does not match the number of models in the pool.
+fusion_bench/method/weighted_average/llama.py
A weighted ensemble is a machine learning technique that combines the predictions of multiple models to produce a final prediction. The idea is to leverage the strengths of each individual model to improve overall performance and robustness.
+Formally, a weighted ensemble can be defined as follows:
+Given a set of \(n\) models, each model \(f_i\) produces a prediction \(f_i(x)\) for an input \(x\). Each model \(i\) also has an associated weight \(w_i\). The final prediction \(F(x)\) of the weighted ensemble is a weighted sum of the individual model predictions:
+The weights \(w_i\) are typically non-negative and sum to 1 (i.e., \(\sum_{i=1}^n w_i = 1\)), which ensures that the final prediction is a convex combination of the individual model predictions. +The weights can be determined in various ways. They could be set based on the performance of the models on a validation set, or they could be learned as part of the training process. In some cases, all models might be given equal weight. +The goal of a weighted ensemble is to produce a final prediction that is more accurate or robust than any individual model. This is particularly useful when the individual models have complementary strengths and weaknesses.
+The following Python code snippet demonstrates how to use the WeightedEnsembleAlgorithm
class from the fusion_bench.method
module to create a weighted ensemble of PyTorch models.
from omegaconf import DictConfig
+from fusion_bench.method import WeightedEnsembleAlgorithm
+
+#Instantiate the algorithm
+method_config = {'name': 'weighted_ensemble', 'weights': [0.3, 0.7]}
+algorithm = WeightedEnsembleAlgorithm(DictConfig(method_config))
+
+# Assume we have a list of PyTorch models (nn.Module instances) that we want to ensemble.
+models = [...]
+
+# Run the algorithm on the models.
+merged_model = algorithm.run(models)
+
Here's a step-by-step explanation:
+Instantiate the WeightedEnsembleAlgorithm
:
method_config
is created with two keys: 'name'
and 'weights'
. The 'name'
key is set to 'weighted_ensemble'
indicating the type of ensemble method to use. The 'weights'
key is set to a list of weights [0.3, 0.7]
indicating the weights assigned to each model in the ensemble.method_config
dictionary is converted to a DictConfig
object, which is a configuration object used by the omegaconf
library.WeightedEnsembleAlgorithm
is then instantiated with the DictConfig
object as an argument.Assume a list of PyTorch models that you want to ensemble. This list is assigned to the variable models
. The actual models are not shown in this code snippet.
Run the algorithm on the models: The run
method of the WeightedEnsembleAlgorithm
instance is called with the models
list as an argument. The result is a merged model that represents the weighted ensemble of the input models. This merged model is assigned to the variable merged_model
.
Here we list the options for the weighted ensemble algorithm:
+Option | +Default | +Description | +
---|---|---|
weights |
++ | A list of floats representing the weights for each model in the ensemble. | +
normalize |
+True |
+Whether to normalize the weights so that they sum to 1. Default is True . |
+
if normalize
is set to True
, the weights will be normalized so that they sum to 1. Mathematically, this means that the weights \(w_i\) will be divided by the sum of all weights, so that
Configuration template for the weighted ensemble algorithm:
+name: weighted_ensemble
+
+# this should be a list of floats, one for each model in the ensemble
+# If weights is null, the ensemble will use the default weights, which are equal weights for all models.
+weights: null
+nomalize: true
+
Construct a weighted ensemble using our CLI tool fusion_bench
:
fusion_bench method=weighted_ensemble \
+ method.weights=[0.3, 0.7] \
+ modelpool=... \
+ taskpool=...
+
WeightedEnsembleAlgorithm
+
+
+¶
+ Bases: BaseAlgorithm
fusion_bench/method/ensemble.py
run(modelpool)
+
+¶Run the weighted ensemble algorithm on the given model pool.
+ + +Parameters:
+modelpool
+¶BaseModelPool | List[Module]
)
+ –
+ The pool of models to ensemble.
+Returns:
+WeightedEnsembleModule
–
+ The weighted ensembled model.
+fusion_bench/method/ensemble.py