The Modality Gap

6 min

Understanding the fundamental separation between visual and textual representations in multimodal models.

Best viewed on desktop for optimal interactive experience

The Modality Gap

The modality gap is a fundamental phenomenon in multimodal models where visual and textual representations occupy distinct regions of the embedding space, even after extensive training to align them.

Interactive Gap Visualization

The Modality Gap

Understanding the inherent separation between visual and textual representation spaces

Modality Selection

Gap Distance
11.15
Vision Spread
18.5
Text Spread
25.3
Overlap %
23.4%

Embedding Space Visualization

Vision SpaceLanguage SpaceGap: 11.15

Feature Distributions

0255075100Feature DimensionDensityVisionLanguage

Understanding the Modality Gap

Key Characteristics

  • Persistent Gap: Remains even after extensive training
  • Universal: Observed across all vision-language models
  • Scale Invariant: Doesn't close with model size
  • Task Dependent: Gap size varies by downstream task

Causes of the Gap

  • Information Density: Images contain more implicit information
  • Representation Bias: Pre-trained encoders have different inductive biases
  • Training Objectives: Contrastive loss doesn't fully align spaces
  • Dimensionality: Different effective dimensions for each modality

Practical Implications

Zero-shot Transfer

Reduced

Gap limits cross-modal generalization

Solution: Bridge with adapter layers

Retrieval Quality

Variable

Asymmetric retrieval performance

Solution: Separate thresholds per direction

Fine-tuning

Required

Task-specific alignment needed

Solution: Learn projection per downstream task

Current Research Directions

Bridging Strategies

  • • Learnable temperature scaling
  • • Multi-level alignment objectives
  • • Modality-specific normalization
  • • Hierarchical embedding spaces

Alternative Approaches

  • • Unified encoder architectures
  • • Intermediate modality tokens
  • • Continuous relaxation methods
  • • Graph-based alignment

What is the Modality Gap?

Definition

The modality gap refers to the systematic separation between embeddings from different modalities in the shared representation space:

Gap(V, T) = \‖μV - μT2

Where:

  • μV = Mean of vision embeddings
  • μT = Mean of text embeddings

Key Characteristics

  1. Persistent: Remains even after billions of training examples
  2. Universal: Observed across all vision-language models
  3. Scale-invariant: Doesn't diminish with model size
  4. Task-dependent: Gap magnitude varies by application

Empirical Evidence

Measuring the Gap

def measure_modality_gap(model, dataloader): vision_embeddings = [] text_embeddings = [] with torch.no_grad(): for images, texts in dataloader: # Get embeddings v_emb = model.encode_image(images) t_emb = model.encode_text(texts) vision_embeddings.append(v_emb) text_embeddings.append(t_emb) # Concatenate all embeddings V = torch.cat(vision_embeddings) T = torch.cat(text_embeddings) # Compute statistics vision_mean = V.mean(dim=0) text_mean = T.mean(dim=0) gap_distance = torch.norm(vision_mean - text_mean) return { 'gap_distance': gap_distance.item(), 'vision_std': V.std().item(), 'text_std': T.std().item(), 'cosine_sim': F.cosine_similarity(vision_mean, text_mean, dim=0).item() }

Typical Gap Measurements

ModelGap DistanceVision SpreadText SpreadOverlap
CLIP-B/320.420.180.2523%
CLIP-L/140.380.160.2327%
ALIGN0.450.200.2821%
OpenCLIP0.400.170.2425%

Causes of the Gap

1. Information Density Differences

Vision and language encode information differently:

# Information content analysis def analyze_information_density(embeddings): # Compute intrinsic dimension pca = PCA(n_components=min(100, len(embeddings))) pca.fit(embeddings) # Cumulative explained variance cumsum = np.cumsum(pca.explained_variance_ratio_) # Effective dimension (90% variance) eff_dim = np.argmax(cumsum >= 0.9) + 1 return { 'effective_dimension': eff_dim, 'entropy': -np.sum(pca.explained_variance_ratio_ * np.log(pca.explained_variance_ratio_ + 1e-10)) }

Findings:

  • Vision: ~60-80 effective dimensions
  • Text: ~40-50 effective dimensions
  • Vision has higher information entropy

2. Pre-training Bias

Different pre-training objectives create distinct representations:

ModalityPre-trainingInductive Bias
VisionClassification/ReconstructionSpatial features
TextLanguage ModelingSequential patterns
Gap ResultDifferent manifoldsMisaligned spaces

3. Architectural Constraints

class VisionEncoder(nn.Module): # Processes 2D spatial information def forward(self, images): # CNN or ViT: spatial relationships features = self.backbone(images) # Global pooling loses spatial info return self.pool(features) class TextEncoder(nn.Module): # Processes 1D sequential information def forward(self, text): # Transformer: sequential dependencies features = self.transformer(text) # CLS token or mean pooling return features[:, 0] # Different aggregation

4. Training Dynamics

The contrastive loss doesn't fully close the gap:

CLIP = -logexp(sim(vi, ti)/τ)Σj exp(sim(vi, tj)/τ)

This loss:

  • Ensures correct pairs have high similarity
  • But doesn't require overlapping distributions
  • Creates "parallel" spaces rather than unified space

Implications

1. Zero-shot Performance

The gap affects cross-modal generalization:

def zero_shot_classification(image_features, text_features, temperature=100): # Normalize features image_features = F.normalize(image_features, dim=-1) text_features = F.normalize(text_features, dim=-1) # Compute similarities logits = temperature * image_features @ text_features.T # Gap causes systematic bias # Need to calibrate with gap-aware temperature gap_adjusted_temp = temperature * (1 + gap_distance) return logits / gap_adjusted_temp

2. Retrieval Asymmetry

Different performance for different directions:

TaskPerformanceExplanation
Image→Text85% R@1Images map to text region
Text→Image78% R@1Text doesn't reach image region
SolutionSeparate thresholdsAccount for gap in scoring

3. Fine-tuning Requirements

class GapAwareAdapter(nn.Module): def __init__(self, dim): super().__init__() # Learn modality-specific projections self.vision_proj = nn.Linear(dim, dim) self.text_proj = nn.Linear(dim, dim) # Learnable gap bridging self.bridge = nn.Parameter(torch.zeros(dim)) def forward(self, features, modality): if modality == 'vision': return self.vision_proj(features) + self.bridge else: return self.text_proj(features) - self.bridge

Bridging Strategies

1. Temperature Scaling

Adjust temperature based on gap:

τeffective = τbase · (1 + α · Gap)

2. Modality-Specific Normalization

class ModalityNorm(nn.Module): def __init__(self, dim): super().__init__() self.vision_norm = nn.LayerNorm(dim) self.text_norm = nn.LayerNorm(dim) self.vision_scale = nn.Parameter(torch.ones(dim)) self.text_scale = nn.Parameter(torch.ones(dim)) def forward(self, features, modality): if modality == 'vision': return self.vision_norm(features) * self.vision_scale else: return self.text_norm(features) * self.text_scale

3. Intermediate Anchors

Use synthetic points to bridge the gap:

def create_anchor_points(vision_emb, text_emb, num_anchors=10): # Linear interpolation between modalities anchors = [] for alpha in np.linspace(0, 1, num_anchors): anchor = alpha * vision_emb.mean(0) + (1-alpha) * text_emb.mean(0) anchors.append(anchor) return torch.stack(anchors)

Theoretical Understanding

Manifold Hypothesis

Vision and text lie on different manifolds:

  • Vision manifold: Smooth, continuous transformations
  • Text manifold: Discrete, compositional structure
  • Gap: Minimum distance between manifolds

Information-Theoretic View

I(V; T) < min(H(V), H(T)) - Gap

The gap represents irreducible information difference between modalities.

Practical Considerations

1. Task-Specific Adaptation

Different tasks require different gap handling:

TaskStrategyImplementation
ClassificationIgnore gapUse relative similarities
RetrievalCalibrateModality-specific thresholds
GenerationBridge explicitlyLearn projection layer
QAPreserve gapMaintain semantic separation

2. Evaluation Metrics

Account for gap in metrics:

def gap_aware_similarity(v_emb, t_emb, gap_distance): # Standard cosine similarity cos_sim = F.cosine_similarity(v_emb, t_emb) # Adjust for expected gap adjusted_sim = cos_sim + gap_distance / 2 # Clip to valid range return torch.clamp(adjusted_sim, -1, 1)

Future Research

Open Questions

  1. Can the gap be eliminated?

    • Unified architectures
    • Joint training from scratch
    • Novel objectives
  2. Is the gap beneficial?

    • Preserves modality-specific information
    • Enables specialized processing
    • May improve robustness
  3. Optimal gap size?

    • Task-dependent optimization
    • Trade-off between alignment and specialization

Emerging Approaches

  • Diffusion bridges: Continuous paths between modalities
  • Optimal transport: Minimize transport cost
  • Gromov-Wasserstein: Align without shared space
  • Hyperbolic embeddings: Natural hierarchy preservation

References

  • Liang et al. "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning"
  • Geirhos et al. "ImageNet-trained CNNs are biased towards texture"
  • Thrush et al. "Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality"

If you found this explanation helpful, consider sharing it with others.

Mastodon