Template Selection and Custom MSA Generation

For advanced users, customizing MSA generation and template selection can significantly improve prediction quality for challenging targets. Learn when and how to take control of these critical inputs.

#Understanding Structural Templates

AlphaFold2 can use known structures as templates during prediction. While the model doesn't strictly require templates, they can guide predictions for:

Proteins with close homologs in PDB
Complex folds with limited MSA coverage
Regions where MSA quality is poor

When Templates Help

Template Benefits

High-quality templates (>40% identity) improve accuracy
Structural constraints guide fold topology
Can improve speed of convergence

When Templates Can Hurt

Template Pitfalls

Low-quality templates (<25% identity) may bias predictions
Templates from different conformational states
Missing or poorly resolved template regions

#Custom MSA Generation

The Multiple Sequence Alignment is AlphaFold2's most important input. Customizing MSA generation is crucial for:

Novel proteins with few homologs
Proteins from understudied organisms
Engineered or synthetic sequences
Proteins where standard search fails

MSA Database Selection

Different databases offer different trade-offs:

UniRef90: Good balance of diversity and speed
UniRef100: More sequences, but more redundancy
MGnify: Metagenomic sequences, excellent for microbial proteins
BFD: Comprehensive but computationally expensive

Optimizing Search Parameters

bash

# Example: More sensitive search for difficult targets
jackhmmer -N 5 --incE 0.001 --cpu 8 \
  query.fasta uniref90.fasta > output.sto

# For very remote homologs
hhblits -i query.fasta -d uniclust30 \
  -n 3 -e 0.001 -maxfilt 100000 -o output.hhr

Parameter Tips

-N / -n: Number of iterations (increase for remote homologs)
-e / --incE: E-value threshold (lower = more stringent)
maxfilt: Maximum sequences (increase for better coverage)

#Strategies for Difficult Targets

Low MSA Coverage

When you have <30 sequences in your MSA:

Expand search: Use more databases (add MGnify for metagenomes)
Relax thresholds: Increase E-value cutoff cautiously
Try profile-based search: Use HHblits instead of BLAST
Consider ESMFold: Doesn't require MSA

Domain-Based Prediction

For multi-domain proteins with variable MSA quality:

Identify domain boundaries using PFAM or InterPro
Generate separate MSAs for each domain
Predict domains independently
Combine predictions or predict full-length with domain-specific MSAs

Why This Works

Different domains may have different evolutionary histories. Domain-specific MSAs capture this better than full-length alignments.

#MSA Pairing for Multimers

For protein complexes, MSA pairing helps identify co-evolving residues:

Paired MSA Generation

Search for sequences where both chains appear in same genome
Use genomic context to infer interactions
Maintain pairing information in MSA format

bash

# Paired MSA search for complex A+B
# Search both sequences against same database
# Keep genome association metadata

#Template-Based Prediction Mode

Providing Custom Templates

When you have specific structural insights:

Homology models from Swiss-Model or Modeller
Experimental structures of close homologs
Previously predicted structures you want to use as starting point

Template Quality

Only use templates with >30% sequence identity. Lower identity templates may bias predictions incorrectly.

Template-Free Mode

For truly novel folds or when you want unbiased predictions:

Disable template search entirely
Rely purely on MSA and learned patterns
Better for proteins with no close structural homologs

#MSA Quality Control

Assessing MSA Quality

Before running prediction, check:

Depth: Number of sequences (aim for >100)
Coverage: Position-wise coverage across sequence
Diversity: Sequence identity distribution
Gaps: Gap patterns in alignment

python

# Example: Analyzing MSA quality
from Bio import AlignIO
alignment = AlignIO.read("msa.a3m", "fasta")

nseqs = len(alignment)
length = alignment.get_alignment_length()

# Calculate coverage per position
coverage = []
for i in range(length):
    column = alignment[:, i]
    coverage.append(1 - (column.count('-') / nseqs))

print(f"Sequences: {nseqs}")
print(f"Length: {length}")
print(f"Avg coverage: {sum(coverage)/len(coverage):.2f}")

#Advanced Techniques

MSA Subsampling

For very large MSAs (>10,000 sequences):

Subsample to ~5,000 diverse representatives
Use clustering (MMseqs2, CD-HIT) to maintain diversity
Keep rare sequences that add unique information

Synthetic MSA Augmentation

Experimental

Some researchers augment MSAs with:

Sequences from protein language models (ESM, ProtTrans)
Guided evolution trajectories
De novo designed sequences with similar fold

This is experimental and not yet standard practice.

#Case Studies

Case 1: Orphan Protein

Problem: Novel protein with only 12 sequences in standard MSA search.

Solution:

Added MGnify metagenome database → 145 sequences
Used ESMFold as orthogonal prediction method
Combined confidence from both methods

Case 2: Engineered Fusion Protein

Problem: Artificial fusion with no natural homologs.