Geometric Model Merging: Efficient Adaptation of Large Language Models for Speech

Geometric model merging represents a paradigm shift in adapting large language models for specialized tasks. Instead of fine-tuning entire models—computationally expensive and prone to catastrophic forgetting—geometric merging combines model weights from multiple specialized models using geometric operations. This approach enables creating domain-specific models without retraining, dramatically reducing computational requirements while maintaining or even improving performance.

The mathematical foundation of geometric model merging lies in the observation that fine-tuned model weights often lie on low-dimensional manifolds in the weight space. When multiple models are fine-tuned from the same base model on different tasks or datasets, their weights cluster in geometrically meaningful regions. By interpolating or averaging weights in these regions, we can create models that combine capabilities from multiple specialized models.

Weight averaging is the simplest geometric merging technique. Given N models with weights {θ₁, θ₂, ..., θₙ}, we compute merged weights as θ_merged = Σ(wᵢ × θᵢ) where wᵢ are learned weights summing to 1. This simple average often works surprisingly well when models are fine-tuned on related tasks. However, direct averaging assumes linearity in weight space, which doesn't always hold. We've found that learned weighted averaging, where weights are optimized on validation data, outperforms simple averages by 1-2 percentage points in accuracy.

More sophisticated approaches use task arithmetic, which exploits the linear relationship between task vectors. Task vectors are computed as differences between fine-tuned and base model weights: τ = θ_finetuned - θ_base. These vectors can be combined linearly to create models for new tasks: θ_new = θ_base + Σ(αᵢ × τᵢ). By learning combination coefficients αᵢ on validation data, we can create models that perform well on target tasks without any additional training.

Fisher information matrix (FIM) weighted merging incorporates uncertainty estimates into the merging process. Models with lower uncertainty (higher Fisher information) for specific parameters receive higher weights during merging. This approach is particularly valuable when merging models trained on datasets of varying sizes or quality. FIM-weighted merging has shown 2-3% accuracy improvements over uniform averaging in our experiments.

RegMean (Regularized Mean) addresses the non-linearity issue by regularizing the merging process. Instead of directly averaging weights, RegMean minimizes the distance between merged model predictions and individual model predictions on a validation set. This optimization-based approach finds weights that better account for non-linear interactions between models. In practice, RegMean achieves accuracy closer to training a new model from scratch, but with computational cost orders of magnitude lower.

For speech recognition applications, we merge models fine-tuned on different domains: general conversation, technical presentations, medical terminology, legal proceedings. The merged model inherits capabilities from all source domains, achieving performance comparable to domain-specific models across all domains. This is particularly valuable for enterprises operating in multiple industries or dealing with diverse audio content.

Computational efficiency is the primary advantage of geometric merging. Training a new model from scratch requires weeks of GPU time and hundreds of thousands of dollars in compute costs. Geometric merging, in contrast, requires hours of computation and can be performed on standard hardware. This democratizes access to high-performance domain-specific models, enabling smaller organizations to leverage state-of-the-art speech recognition without massive computational resources.

Limitations and considerations include the assumption that source models are compatible (fine-tuned from the same base architecture). Merging incompatible models or models with vastly different training regimes often produces poor results. Additionally, geometric merging works best when source tasks are related; merging completely unrelated tasks may not provide benefits. Careful validation on target tasks is essential before deploying merged models in production.