Mutated Peptide Generator

The Mutated Peptide Generator tool will take a SNPEff-annotated VCF as input and generate predicted neo-peptides and the reference / WT peptides with which they pair.

VCF Input Files

This tool accepts VCF files as input. We list a few recommended steps to take with input VCFs before running this tool.

The last line of the VCF headers should looks similar to:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  {SAMPLE1}   {SAMPLE2}  ... {SAMPLEN}

The ‘SAMPLE’ columns are not required.

Recommended: Strip Unnecessary Annotations

Stripping unnecessary annotations from the VCF is not required, but recommended as it will help to ensure a consistent starting point. Annotations can be stripped with vcftools. For example:

vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| bgzip -c > infile.stripped.vcf.gz

# index the VCF
tabix infile.stripped.vcf.gz

Recommended: Decomposition and Normalization

Normalization and decomposition of the VCF are highly recommended, but not required. VCF decomposition implies splitting multiallelic variants so each variant has his own record, and normalization generates the most parsimonious allele, meaning that each variant is represented in as few nucleotides as possible.

This step should be done before annotation with SNPEff, using vt. Here is an example command:

zcat < infile.stripped.vcf.gz \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz

# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz

The above steps can be combined into one, e.g.:

vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz

# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz

REFERENCE should point to the genomic fasta file, used for sequence alignments. For instance, Homo_sapiens.GRCh38.dna.primary_assembly.fa.

Recommended: SNPEff annotation

Although we provide the option to run SNPEff annotation on unannoated VCFs, we recommend that users run this before uploading VCFs to the tool.

Other annotators are available, but the MPG tool will only work with SNPEff annotations.

In order to run SNPEff, you will need to prepare a small file that includes the name of the ‘normal’ and ‘tumor’ samples as they appear in the VCF file. In the command below, this is our ‘in.samples’ file. Here is the recommended command, for annotation with GRCh38.86 reference genome:

echo -e "normal_sample\ttumor_sample" > in.samples
zcat < infile.stripped.decomposed.normalized.vcf.gz \
| java -Xmx16G -jar snpEff.jar -cancer -cancerSamples in.samples GRCh38.86 \
| > infile.ann.vcf

Parameter Selection

MPG Parameters

Peptide Length
- The number of amino acids in the peptide
Peptide 1 and 2 Mutation Position
- Position in the peptide where the mutation should be located. By default, only one peptide per variant will be created. If you wish to create an additional peptide for a given variant with the SNP at a different position, fill in the ‘Peptide 2 Mutation Position’ field.
Frameshift Overlap
- Frameshift mutations often result in a relatively long stretch of amino acids that are different from the reference. This tool will break that long stretch into overlapping peptides with an overlap length specified by this parameter.
Maximum Peptide Length
- There are several instances where creating a peptide longer than the ‘Peptide Length’ parameter may be desirable. For instance, in-frame insertions of a few amino acids might require extending the peptide to keep the C termini of the reference and mutant peptides aligned. Additionally, if a frameshift results in a mutated sequence that is only several amino acids longer than the ‘peptide length’, it might be desirable to create a longer peptide rather than break it up into highly overlapping peptides. Finally, when a variant is near the termini and multiple positions are selected, simply generating a single longer peptide rather than multiple, short, highly-overlapping peptides may be desirable. The allowed values range from the selected ‘Peptide Length’ plus 1 to plus 10 amino acids.
Reference Genome
- Options include GRCh38 (default), GRCh37, and GRCm38/mm10. The selected genome should match the genome used to generate the VCF file.
run SNPeff annotation
- If checked, SNPEff will be executed against the VCF file before running through the peptide generation tool. We recommend that users take care of this step themselves before uploading the VCF, as it can be time-consuming.

Results

Three tables are included in the output:

SNPs - One row per SNP.
Peptide - One row per SNP and affected transcript.
Unique Peptide - One row per SNP and unique peptide. Peptides that can be produced by multiple transcripts will be collapsed in this output and a representative transcript is selected.

Each of the outputs contains many columns, which are described in detail below.

Column Definitions

output table	column	definition	example1	example2
all	chr	chromosome	chr1	chr19
all	position	chromosomal position of mutation	49045703	6477239
all	reference nucleotide	reference nucleotide	C	A
all	mutated nucleotide	mutant nucleotide	T	AG
all	mutation effect	predicted mutation effect (e.g., missense_variant, frameshift_variant, inframe_insertion, inframe_deletion, etc.)	missense_variant	frameshift_variant
all	gene name	HGNC gene symbol	AGBL4	DENND1C
all	Ensembl gene accession	Ensembl gene identifier	ENSG00000186094	ENSG00000205744
all	Ensemble transcript accession	Ensembl transcript identifier	ENST00000416121	ENST00000381480
all	reference aa	reference amino acid	Asp	Thr
all	mutated aa	mutated amino acid	Asn	fs
all	protein position	mutation position in protein	4/298	164/801
all	variant id	an internal unique identifier assigned to each variant	11089	116919
all	mutation impact	SNPEff-predicted variant impace LOW/MODERATE/HIGH.	MODERATE	HIGH
all	transcript biotype	a classification of the transcript type. These will include protein_coding, the different IG _ and TR _ types, as well as nonsense_mediated_decay, non_stop_decay, pseudogene, etc.	protein_coding	protein_coding
all	transcript mutation code	mutation in hgvs format (nucleotide level) with coordinates based on the transcript	c.10G>A	c.491dupC
all	protein mutation code	mutation in hgvs format (amino acid level) with coordinates based on the protein	p.Asp4Asn	p.Thr165fs
all	cdna position	mutation position in cdna	12/3938	604/2816
all	cds position	mutation position in cds	10/897	491/2406
SNP	peptide pairs	list of reference-mutant peptide pairs derived from this SNP, along with peptide mutation position. Corresopnding peptides will be found in the peptide output tables.	[('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, []), ('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, [])]	[('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, []), ('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, [])]
SNP	peptide warnings	warnings from peptide generation for each variant. This will include all warnings for successfully generated peptides, as well as warnings for variants where peptides could not be generated	protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon	Reached end of frameshift mutation position may vary
peptide	peptide pair id	A serial number for peptide pairs in the peptide-output table.
peptide	transcript reference allele	reference allele (nucleotide) decoded from hgvs_dna	C	C
peptide	transcript mutant allele	tumor allele (nucleotide) decoded from hgvs_dna	T	CC
peptide	reference peptide	reference peptide with requested PEPTIDELENGTH	REEDIYQFAYCYPYTYTRFQ	LGSGVTVSSGQGIPPPTRGN
peptide	mutated peptide	mutant peptide with requested PEPTIDELENGTH	REENIYQFAYCYPYTYTRFQ	LGSGVTVSSGQGIPPPYPGE
peptide	peptide mutation position	peptide mutation position	4	17
peptide	strand	transcript strand: 1 for sense and -1 for anti-sense strand	-1	-1
peptide	warnings	any warnings associated with peptide generation for each reference-mutant peptide pair	protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon	Reached end of frameshift mutation position may vary