Vision-Language Adapters: Parameter-Efficient Multimodal Fine-tuning
Exploring LoRA, adapters, and other parameter-efficient methods for fine-tuning large vision-language models.
Best viewed on desktop for optimal interactive experience
Vision-Language Adapters
As vision-language models grow to billions of parameters, fine-tuning becomes computationally prohibitive. Parameter-efficient fine-tuning (PEFT) methods like LoRA enable adaptation with <1% of parameters while maintaining 95%+ performance.
Interactive Adapter Explorer
Vision-Language Adapters
Parameter-efficient fine-tuning methods for multimodal models
Adapter Configuration
LoRA (Low-Rank Adaptation)
Decomposes weight updates into low-rank matrices
Model Architecture
Add adapters to every layer
Performance Metrics
Low-Rank Decomposition
LoRA Decomposition (Rank = 16)
Method Comparison
Implementation Examples
LoRA for LLaVA
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "v_proj",
"k_proj", "o_proj"
],
lora_dropout=0.1,
)
model = get_peft_model(model, config)
# Trainable: 0.1M params
Adapter Placement
# Vision-Language specific
vision_modules = [
"vision_model.*.self_attn",
"vision_model.*.mlp"
]
language_modules = [
"language_model.*.self_attn",
"language_model.*.mlp"
]
# Apply based on strategy
if strategy == "vision-only":
target = vision_modules
elif strategy == "language-only":
target = language_modules
else:
target = vision_modules + language_modules
Best Practices for Multimodal Adapters
Adapter Selection
- •LoRA: Best for large models with limited GPU
- •Adapters: Good for multi-task learning
- •Prefix: Ideal for prompt-based tasks
- •BitFit: Extremely lightweight, quick experiments
Optimization Tips
- •Start with rank 16 for LoRA, adjust based on task
- •Vision layers often need less adaptation than language
- •Combine methods for complex multimodal tasks
- •Monitor gradient norms to detect under/over-parameterization
Why Adapters for Multimodal?
The Challenge
Fine-tuning a 7B parameter LLaVA model requires:
- Memory: 28GB+ for model weights alone
- Compute: Multiple A100 GPUs for reasonable batch sizes
- Storage: Separate copy for each downstream task
- Time: Hours to days of training
The Solution
Adapters reduce requirements by 10-100x:
- Memory: Add only ~10-100MB to base model
- Compute: Train on single consumer GPU
- Storage: Share frozen base, swap small adapters
- Time: Minutes to hours of training
Core Adapter Methods
1. LoRA (Low-Rank Adaptation)
Decomposes weight updates into low-rank matrices:
Where:
- W0 ∈ ℝd × d = Frozen pretrained weights
- B ∈ ℝd × r = Down-projection matrix
- A ∈ ℝr × d = Up-projection matrix
- r ≪ d = Low rank (typically 1-64)
class LoRALayer(nn.Module): def __init__(self, in_dim, out_dim, rank=16, alpha=32): super().__init__() self.lora_A = nn.Parameter(torch.randn(rank, in_dim)) self.lora_B = nn.Parameter(torch.zeros(out_dim, rank)) self.scaling = alpha / rank def forward(self, x, frozen_weight): # Frozen path out = F.linear(x, frozen_weight) # LoRA path out += (x @ self.lora_A.T @ self.lora_B.T) * self.scaling return out
Advantages:
- No inference latency (merge at deployment)
- Minimal memory overhead
- Works with any linear layer
For Multimodal:
- Apply to attention layers in both vision and language
- Different ranks for different modalities
- Can merge multiple LoRAs for multi-task
2. Bottleneck Adapters
Insert small bottleneck layers between transformer blocks:
class Adapter(nn.Module): def __init__(self, dim, bottleneck_dim=64): super().__init__() self.down_project = nn.Linear(dim, bottleneck_dim) self.up_project = nn.Linear(bottleneck_dim, dim) self.nonlinearity = nn.ReLU() def forward(self, x): residual = x x = self.down_project(x) x = self.nonlinearity(x) x = self.up_project(x) return x + residual # Skip connection
Advantages:
- Modular design
- Easy to add/remove
- Good for multi-task learning
For Multimodal:
- Separate adapters for vision/language paths
- Cross-modal adapters for interaction layers
- Task-specific adapter banks
3. Prefix Tuning
Prepend learnable tokens to input sequences:
class PrefixTuning(nn.Module): def __init__(self, num_tokens=20, dim=768, num_layers=12): super().__init__() self.prefix_tokens = nn.Parameter(torch.randn(num_tokens, dim)) self.prefix_mlp = nn.Sequential( nn.Linear(dim, dim * 2), nn.Tanh(), nn.Linear(dim * 2, num_layers * 2 * dim) ) def forward(self, batch_size): prefix = self.prefix_mlp(self.prefix_tokens) prefix = prefix.view(-1, 2, self.num_layers, self.dim) # Expand for batch return prefix.expand(batch_size, -1, -1, -1)
Advantages:
- No architectural changes
- Works with black-box models
- Interpretable as soft prompts
For Multimodal:
- Separate prefixes for image and text
- Cross-modal prefix tokens
- Task-specific prefix banks
4. BitFit
Only tune bias terms:
def apply_bitfit(model): # Freeze all parameters for param in model.parameters(): param.requires_grad = False # Unfreeze only bias terms for name, param in model.named_parameters(): if 'bias' in name: param.requires_grad = True return model
Advantages:
- Extremely lightweight (~0.1% params)
- Fast training
- Surprisingly effective
For Multimodal:
- Quick baseline for new tasks
- Combine with other methods
- Good for few-shot scenarios
Placement Strategies
Where to Add Adapters?
Different placement strategies yield different trade-offs:
Strategy | Vision | Language | Performance | Parameters |
---|---|---|---|---|
All Layers | ✅ | ✅ | 95% | 100% |
Top Layers | ❌ | ✅ (top half) | 88% | 50% |
Bottom Layers | ✅ (bottom half) | ❌ | 82% | 50% |
Sparse (every 2nd) | ✅ | ✅ | 90% | 50% |
Vision Only | ✅ | ❌ | 78% | 33% |
Language Only | ❌ | ✅ | 85% | 67% |
Multimodal-Specific Considerations
def get_adapter_config(task_type): if task_type == "visual_grounding": # Focus on vision encoder return { "vision_layers": [0, 1, 2, 3, 4, 5], "language_layers": [10, 11], # Only top layers "rank": 32 } elif task_type == "image_captioning": # Focus on language generator return { "vision_layers": [4, 5], # Only top layers "language_layers": list(range(12)), # All layers "rank": 16 } elif task_type == "vqa": # Balance both modalities return { "vision_layers": list(range(0, 6, 2)), # Every other "language_layers": list(range(0, 12, 2)), "rank": 24 }
Advanced Techniques
1. Orthogonal Adaptation
Ensure adapters learn complementary features:
2. Adaptive Rank Selection
Dynamic rank based on layer importance:
def adaptive_rank(layer_idx, total_layers): # Higher rank for middle layers if layer_idx < total_layers * 0.25: return 8 elif layer_idx < total_layers * 0.75: return 32 else: return 16
3. Multi-Task Adapters
Share and compose adapters for multiple tasks:
class MultiTaskAdapter(nn.Module): def __init__(self, tasks, dim, rank=16): super().__init__() # Shared component self.shared = LoRALayer(dim, dim, rank // 2) # Task-specific components self.task_specific = nn.ModuleDict({ task: LoRALayer(dim, dim, rank // 2) for task in tasks }) def forward(self, x, task): x = self.shared(x) x = self.task_specific[task](x) return x
4. Cross-Modal Adapters
Special adapters for vision-language interaction:
class CrossModalAdapter(nn.Module): def __init__(self, vision_dim, text_dim, rank=16): super().__init__() self.vision_to_text = LoRALayer(vision_dim, text_dim, rank) self.text_to_vision = LoRALayer(text_dim, vision_dim, rank) self.gate = nn.Parameter(torch.zeros(1)) def forward(self, vision_features, text_features): # Bidirectional adaptation v2t = self.vision_to_text(vision_features) t2v = self.text_to_vision(text_features) # Gated combination gate = torch.sigmoid(self.gate) vision_adapted = vision_features + gate * t2v text_adapted = text_features + (1 - gate) * v2t return vision_adapted, text_adapted
Performance Analysis
Empirical Results on LLaVA
Method | Params | VQAv2 | GQA | TextVQA | Memory | Time |
---|---|---|---|---|---|---|
Full Fine-tuning | 7B | 79.5 | 63.3 | 58.2 | 28GB | 24h |
LoRA (r=16) | 19M | 78.8 | 62.7 | 57.1 | 8GB | 3h |
LoRA (r=32) | 38M | 79.2 | 63.0 | 57.8 | 9GB | 4h |
Adapters | 45M | 78.5 | 62.4 | 56.9 | 9GB | 4h |
Prefix | 10M | 77.2 | 61.8 | 55.4 | 7GB | 2h |
BitFit | 7M | 75.1 | 60.2 | 53.8 | 7GB | 1h |
Scaling Laws for Adapters
Adapter performance follows predictable patterns:
Where:
- r = Adapter rank
- α ≈ 0.2 = Task-dependent constant
- β ≈ 0.5 = Scaling exponent
Implementation Guide
Setting Up LoRA for LLaVA
from transformers import LlavaForConditionalGeneration from peft import LoraConfig, get_peft_model, TaskType # Load base model model = LlavaForConditionalGeneration.from_pretrained( "llava-hf/llava-1.5-7b-hf", load_in_8bit=True # Quantization for memory ) # Configure LoRA peft_config = LoraConfig( task_type=TaskType.MULTIMODAL, r=32, lora_alpha=64, lora_dropout=0.1, target_modules=[ # Vision encoder "vision_model.encoder.layers.*.self_attn.q_proj", "vision_model.encoder.layers.*.self_attn.v_proj", # Language model "language_model.model.layers.*.self_attn.q_proj", "language_model.model.layers.*.self_attn.v_proj", # Cross-attention (if present) "language_model.model.layers.*.encoder_attn.q_proj", "language_model.model.layers.*.encoder_attn.v_proj", ] ) # Apply LoRA model = get_peft_model(model, peft_config) model.print_trainable_parameters() # Output: trainable params: 37,748,736 || all params: 7,064,493,056 || trainable%: 0.534
Training Best Practices
# Differential learning rates optimizer = torch.optim.AdamW([ {'params': vision_adapters.parameters(), 'lr': 1e-4}, {'params': language_adapters.parameters(), 'lr': 2e-4}, {'params': cross_modal_adapters.parameters(), 'lr': 5e-5} ]) # Gradual unfreezing def gradual_unfreeze(model, epoch): if epoch == 0: # Only adapters freeze_base_model(model) elif epoch == 5: # Unfreeze top layers unfreeze_top_layers(model, n=2) elif epoch == 10: # Unfreeze all unfreeze_all(model)
Deployment Strategies
1. Merge and Deploy
# Merge LoRA weights for zero-overhead inference def merge_lora_weights(model): for name, module in model.named_modules(): if hasattr(module, 'lora_A'): # W' = W + BA module.weight.data += ( module.lora_B @ module.lora_A ) * module.scaling return model
2. Dynamic Adapter Loading
class AdapterBank: def __init__(self, base_model): self.base_model = base_model self.adapters = {} def load_adapter(self, task, path): adapter = torch.load(path) self.adapters[task] = adapter def inference(self, inputs, task): # Dynamically apply task adapter self.base_model.load_adapter(self.adapters[task]) return self.base_model(inputs)
3. Multi-Task Serving
# Serve multiple tasks with single base model tasks = ["vqa", "captioning", "grounding"] adapters = {task: load_adapter(f"{task}.pt") for task in tasks} def handle_request(image, text, task): model.set_adapter(adapters[task]) return model.generate(image, text)
Future Directions
Research Frontiers
- Mixture of Adapters: Route to specialized adapters
- Neural Architecture Search: Automated adapter placement
- Continual Learning: Sequential task adaptation without forgetting
- Cross-lingual Adapters: Multilingual vision-language models
Emerging Techniques
- QLoRA: Quantized base + LoRA for 4-bit training
- DoRA: Weight decomposition for better adaptation
- AdaLoRA: Adaptive rank allocation
- VeRA: Vector-based random adaptation
Related Concepts
- Scaling Laws - How adapters affect scaling
- Alignment Problem - Adapters for alignment
- Modality Gap - Cross-modal adapter strategies
References
- Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models"
- Houlsby et al. "Parameter-Efficient Transfer Learning for NLP"
- Liu et al. "LLaVA: Large Language and Vision Assistant"
- Zaken et al. "BitFit: Simple Parameter-efficient Fine-tuning"
- Li & Liang "Prefix-Tuning: Optimizing Continuous Prompts"