For advanced users, customizing MSA generation and template selection can significantly improve prediction quality for challenging targets. Learn when and how to take control of these critical inputs.
#Understanding Structural Templates
AlphaFold2 can use known structures as templates during prediction. While the model doesn't strictly require templates, they can guide predictions for:
- Proteins with close homologs in PDB
- Complex folds with limited MSA coverage
- Regions where MSA quality is poor
When Templates Help
Template Benefits
- High-quality templates (>40% identity) improve accuracy
- Structural constraints guide fold topology
- Can improve speed of convergence
When Templates Can Hurt
Template Pitfalls
- Low-quality templates (<25% identity) may bias predictions
- Templates from different conformational states
- Missing or poorly resolved template regions
#Custom MSA Generation
The Multiple Sequence Alignment is AlphaFold2's most important input. Customizing MSA generation is crucial for:
- Novel proteins with few homologs
- Proteins from understudied organisms
- Engineered or synthetic sequences
- Proteins where standard search fails
MSA Database Selection
Different databases offer different trade-offs:
- UniRef90: Good balance of diversity and speed
- UniRef100: More sequences, but more redundancy
- MGnify: Metagenomic sequences, excellent for microbial proteins
- BFD: Comprehensive but computationally expensive
Optimizing Search Parameters
# Example: More sensitive search for difficult targets
jackhmmer -N 5 --incE 0.001 --cpu 8 \
query.fasta uniref90.fasta > output.sto
# For very remote homologs
hhblits -i query.fasta -d uniclust30 \
-n 3 -e 0.001 -maxfilt 100000 -o output.hhrParameter Tips
- -N / -n: Number of iterations (increase for remote homologs)
- -e / --incE: E-value threshold (lower = more stringent)
- maxfilt: Maximum sequences (increase for better coverage)
#Strategies for Difficult Targets
Low MSA Coverage
When you have <30 sequences in your MSA:
- Expand search: Use more databases (add MGnify for metagenomes)
- Relax thresholds: Increase E-value cutoff cautiously
- Try profile-based search: Use HHblits instead of BLAST
- Consider ESMFold: Doesn't require MSA
Domain-Based Prediction
For multi-domain proteins with variable MSA quality:
- Identify domain boundaries using PFAM or InterPro
- Generate separate MSAs for each domain
- Predict domains independently
- Combine predictions or predict full-length with domain-specific MSAs
Why This Works
#MSA Pairing for Multimers
For protein complexes, MSA pairing helps identify co-evolving residues:
Paired MSA Generation
- Search for sequences where both chains appear in same genome
- Use genomic context to infer interactions
- Maintain pairing information in MSA format
# Paired MSA search for complex A+B
# Search both sequences against same database
# Keep genome association metadata#Template-Based Prediction Mode
Providing Custom Templates
When you have specific structural insights:
- Homology models from Swiss-Model or Modeller
- Experimental structures of close homologs
- Previously predicted structures you want to use as starting point
Template Quality
Template-Free Mode
For truly novel folds or when you want unbiased predictions:
- Disable template search entirely
- Rely purely on MSA and learned patterns
- Better for proteins with no close structural homologs
#MSA Quality Control
Assessing MSA Quality
Before running prediction, check:
- Depth: Number of sequences (aim for >100)
- Coverage: Position-wise coverage across sequence
- Diversity: Sequence identity distribution
- Gaps: Gap patterns in alignment
# Example: Analyzing MSA quality
from Bio import AlignIO
alignment = AlignIO.read("msa.a3m", "fasta")
nseqs = len(alignment)
length = alignment.get_alignment_length()
# Calculate coverage per position
coverage = []
for i in range(length):
column = alignment[:, i]
coverage.append(1 - (column.count('-') / nseqs))
print(f"Sequences: {nseqs}")
print(f"Length: {length}")
print(f"Avg coverage: {sum(coverage)/len(coverage):.2f}")#Advanced Techniques
MSA Subsampling
For very large MSAs (>10,000 sequences):
- Subsample to ~5,000 diverse representatives
- Use clustering (MMseqs2, CD-HIT) to maintain diversity
- Keep rare sequences that add unique information
Synthetic MSA Augmentation
Experimental
- Sequences from protein language models (ESM, ProtTrans)
- Guided evolution trajectories
- De novo designed sequences with similar fold
#Case Studies
Case 1: Orphan Protein
Problem: Novel protein with only 12 sequences in standard MSA search.
Solution:
- Added MGnify metagenome database → 145 sequences
- Used ESMFold as orthogonal prediction method
- Combined confidence from both methods
Case 2: Engineered Fusion Protein
Problem: Artificial fusion with no natural homologs.
Solution:
- Generated separate MSAs for each domain
- Concatenated MSAs with proper gap handling
- Used linker region flexibility in analysis
#Tools and Resources
- HH-suite: Profile-profile search (HHblits, HHpred)
- Jackhmmer: Iterative sequence search (HMMER package)
- MMseqs2: Fast clustering and search
- Clustal Omega: MSA generation and refinement
Try Advanced MSA Customization
Access custom MSA tools on Protogen Bio
#Best Practices Summary
Advanced MSA Checklist
- ✓ Assess standard MSA quality before customization
- ✓ Try multiple database combinations for low-coverage targets
- ✓ Use templates only when >30% identity available
- ✓ Consider domain-based prediction for multi-domain proteins
- ✓ Validate custom MSAs before expensive predictions
- ✓ Document all customizations for reproducibility