Template Selection and Custom MSA Generation
Back to Academy
advanced24 min read

Template Selection and Custom MSA Generation

Advanced guide to structural templates and MSA customization: learn when templates help or hurt, and how to optimize MSA searches for difficult targets.

P

Protogen Team

Computational Biologists

February 4, 2025

For advanced users, customizing MSA generation and template selection can significantly improve prediction quality for challenging targets. Learn when and how to take control of these critical inputs.

#Understanding Structural Templates

AlphaFold2 can use known structures as templates during prediction. While the model doesn't strictly require templates, they can guide predictions for:

  • Proteins with close homologs in PDB
  • Complex folds with limited MSA coverage
  • Regions where MSA quality is poor

When Templates Help

Template Benefits

  • High-quality templates (>40% identity) improve accuracy
  • Structural constraints guide fold topology
  • Can improve speed of convergence

When Templates Can Hurt

Template Pitfalls

  • Low-quality templates (<25% identity) may bias predictions
  • Templates from different conformational states
  • Missing or poorly resolved template regions

#Custom MSA Generation

The Multiple Sequence Alignment is AlphaFold2's most important input. Customizing MSA generation is crucial for:

  • Novel proteins with few homologs
  • Proteins from understudied organisms
  • Engineered or synthetic sequences
  • Proteins where standard search fails

MSA Database Selection

Different databases offer different trade-offs:

  • UniRef90: Good balance of diversity and speed
  • UniRef100: More sequences, but more redundancy
  • MGnify: Metagenomic sequences, excellent for microbial proteins
  • BFD: Comprehensive but computationally expensive

Optimizing Search Parameters

bash
# Example: More sensitive search for difficult targets
jackhmmer -N 5 --incE 0.001 --cpu 8 \
  query.fasta uniref90.fasta > output.sto

# For very remote homologs
hhblits -i query.fasta -d uniclust30 \
  -n 3 -e 0.001 -maxfilt 100000 -o output.hhr

Parameter Tips

  • -N / -n: Number of iterations (increase for remote homologs)
  • -e / --incE: E-value threshold (lower = more stringent)
  • maxfilt: Maximum sequences (increase for better coverage)

#Strategies for Difficult Targets

Low MSA Coverage

When you have <30 sequences in your MSA:

  • Expand search: Use more databases (add MGnify for metagenomes)
  • Relax thresholds: Increase E-value cutoff cautiously
  • Try profile-based search: Use HHblits instead of BLAST
  • Consider ESMFold: Doesn't require MSA

Domain-Based Prediction

For multi-domain proteins with variable MSA quality:

  • Identify domain boundaries using PFAM or InterPro
  • Generate separate MSAs for each domain
  • Predict domains independently
  • Combine predictions or predict full-length with domain-specific MSAs

Why This Works

Different domains may have different evolutionary histories. Domain-specific MSAs capture this better than full-length alignments.

#MSA Pairing for Multimers

For protein complexes, MSA pairing helps identify co-evolving residues:

Paired MSA Generation

  • Search for sequences where both chains appear in same genome
  • Use genomic context to infer interactions
  • Maintain pairing information in MSA format
bash
# Paired MSA search for complex A+B
# Search both sequences against same database
# Keep genome association metadata

#Template-Based Prediction Mode

Providing Custom Templates

When you have specific structural insights:

  • Homology models from Swiss-Model or Modeller
  • Experimental structures of close homologs
  • Previously predicted structures you want to use as starting point

Template Quality

Only use templates with >30% sequence identity. Lower identity templates may bias predictions incorrectly.

Template-Free Mode

For truly novel folds or when you want unbiased predictions:

  • Disable template search entirely
  • Rely purely on MSA and learned patterns
  • Better for proteins with no close structural homologs

#MSA Quality Control

Assessing MSA Quality

Before running prediction, check:

  • Depth: Number of sequences (aim for >100)
  • Coverage: Position-wise coverage across sequence
  • Diversity: Sequence identity distribution
  • Gaps: Gap patterns in alignment
python
# Example: Analyzing MSA quality
from Bio import AlignIO
alignment = AlignIO.read("msa.a3m", "fasta")

nseqs = len(alignment)
length = alignment.get_alignment_length()

# Calculate coverage per position
coverage = []
for i in range(length):
    column = alignment[:, i]
    coverage.append(1 - (column.count('-') / nseqs))

print(f"Sequences: {nseqs}")
print(f"Length: {length}")
print(f"Avg coverage: {sum(coverage)/len(coverage):.2f}")

#Advanced Techniques

MSA Subsampling

For very large MSAs (>10,000 sequences):

  • Subsample to ~5,000 diverse representatives
  • Use clustering (MMseqs2, CD-HIT) to maintain diversity
  • Keep rare sequences that add unique information

Synthetic MSA Augmentation

Experimental

Some researchers augment MSAs with:
  • Sequences from protein language models (ESM, ProtTrans)
  • Guided evolution trajectories
  • De novo designed sequences with similar fold
This is experimental and not yet standard practice.

#Case Studies

Case 1: Orphan Protein

Problem: Novel protein with only 12 sequences in standard MSA search.

Solution:

  • Added MGnify metagenome database → 145 sequences
  • Used ESMFold as orthogonal prediction method
  • Combined confidence from both methods

Case 2: Engineered Fusion Protein

Problem: Artificial fusion with no natural homologs.

Solution:

  • Generated separate MSAs for each domain
  • Concatenated MSAs with proper gap handling
  • Used linker region flexibility in analysis

#Tools and Resources

  • HH-suite: Profile-profile search (HHblits, HHpred)
  • Jackhmmer: Iterative sequence search (HMMER package)
  • MMseqs2: Fast clustering and search
  • Clustal Omega: MSA generation and refinement

Try Advanced MSA Customization

Access custom MSA tools on Protogen Bio

#Best Practices Summary

Advanced MSA Checklist

  • ✓ Assess standard MSA quality before customization
  • ✓ Try multiple database combinations for low-coverage targets
  • ✓ Use templates only when >30% identity available
  • ✓ Consider domain-based prediction for multi-domain proteins
  • ✓ Validate custom MSAs before expensive predictions
  • ✓ Document all customizations for reproducibility