Mutant Peptide Generator
The Mutant Peptide Generator tool will take a two-sample, SNPEff-annotated VCF as input and generate predicted neo-peptides and the reference / WT peptides with which they pair.
VCF input files
This tool accepts VCF files as input. The VCF files must meet several specific requirements listed below. We also list a few recommended steps to take with input VCFs before running this tool. The requirements and recommendations are listed below in the order in which they should be met / applied.
Requirement: Two-sample VCF
The VCF file must be a two-sample VCF with the normal/healthy sample in the first set of columns and the tumor sample in the second set. The headers should looks similar to:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT {NORMAL_SAMPLE} {TUMOR_SAMPLE}
Here, NORMAL_SAMPLE
and TUMOR_SAMPLE
should be replaced with the corresponding sample names.
Recommended: Strip unnecessary annotations
Stripping unnecessary annotations from the VCF is not required, but recommended as it will help to ensure a consistent
starting point. Annotations can be stripped with vcftools
. For example:
vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| bgzip -c > infile.stripped.vcf.gz
# index the VCF
tabix infile.stripped.vcf.gz
Recommended: Decomposition and normalization
Normalization and decomposition of the VCF are highly recommended, but not required. This step should be done before annotation with SNPEff. Here is an example command:
zcat < infile.stripped.vcf.gz \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz
# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz
The above steps can be combined into one, e.g.:
vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz
# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz
REFERENCE
should point to the genomic fasta file, used for sequence alignments. For instance, Homo_sapiens.GRCh38.dna.primary_assembly.fa
.
Requirement: SNPEff annotation
Currently, only VCFs that have been annotated with SNPEff are supported. In the future, SNPeff annotation may be integrated into this tool.
In order to run SNPEff, you will need to prepare a small file that includes the name of the ‘normal’ and ‘tumor’ samples as they appear in the VCF file. In the command below, this is our ‘in.samples’ file. Here is the recommended command, for annotation with GRCh38.86 reference genome:
echo -e "normal_sample\ttumor_sample" > in.samples
zcat < infile.stripped.decomposed.normalized.vcf.gz \
| java -Xmx16G -jar snpEff.jar -cancer -cancerSamples in.samples GRCh38.86 \
| > infile.ann.vcf
Parameter selection
Peptide Length
The number of amino acids in the peptide
Peptide 1 and 2 Mutation Position
Position in the peptide where the mutation should be located. By default, only one peptide per variant will be created. If you wish to create an additional peptide for a given variant with the SNP at a different position, fill in the ‘Peptide 2 Mutation Position’ field.
Frameshift Overlap
Frameshift mutations often result in a relatively long stretch of amino acids that are different from the reference. This tool will break that long stretch into overlapping peptides with an overlap length specified by this parameter.
Maximum Peptide Length
There are several instances where creating a peptide longer than the ‘Peptide Length’ parameter may be desirable. For instance, in-frame insertions of a few amino acids might require extending the peptide to keep the C termini of the reference and mutant peptides aligned. Additionally, if a frameshift results in a mutated sequence that is only several amino acids longer than the ‘peptide length’, it might be desirable to create a longer peptide rather than break it up into highly overlapping peptides. Finally, when a variant is near the termini and multiple positions are selected, simply generating a single longer peptide rather than multiple, short, highly-overlapping peptides may be desirable.
Results
Three tables are included in the output:
SNPs - One row per SNP.
Peptide - One row per SNP and affected transcript.
Unique Peptide - One row per SNP and unique peptide. Peptides that can be produced by multiple transcripts will be collapsed in this output and a representative transcript is selected.
Each of the outputs contains many columns, which are described in detail below.
Column definitions
output table | column | definition | example1 | example2 |
---|---|---|---|---|
all | chr | chromosome | chr1 | chr19 |
all | position | chromosomal position of mutation | 49045703 | 6477239 |
all | reference nucleotide | reference nucleotide | C | A |
all | mutated nucleotide | mutant nucleotide | T | AG |
all | mutation effect | predicted mutation effect (e.g., missense_variant, frameshift_variant, inframe_insertion, inframe_deletion, etc.) | missense_variant | frameshift_variant |
all | gene name | HGNC gene symbol | AGBL4 | DENND1C |
all | Ensembl gene accession | Ensembl gene identifier | ENSG00000186094 | ENSG00000205744 |
all | Ensemble transcript accession | Ensembl transcript identifier | ENST00000416121 | ENST00000381480 |
all | reference aa | reference amino acid | Asp | Thr |
all | mutated aa | mutated amino acid | Asn | fs |
all | protein position | mutation position in protein | 4/298 | 164/801 |
all | variant id | an internal unique identifier assigned to each variant | 11089 | 116919 |
all | mutation impact | SNPEff-predicted variant impace LOW/MODERATE/HIGH. | MODERATE | HIGH |
all | transcript biotype | a classification of the transcript type. These will include protein_coding, the different IG _ and TR _ types, as well as nonsense_mediated_decay, non_stop_decay, pseudogene, etc. | protein_coding | protein_coding |
all | transcript mutation code | mutation in hgvs format (nucleotide level) with coordinates based on the transcript | c.10G>A | c.491dupC |
all | protein mutation code | mutation in hgvs format (amino acid level) with coordinates based on the protein | p.Asp4Asn | p.Thr165fs |
all | cdna position | mutation position in cdna | 12/3938 | 604/2816 |
all | cds position | mutation position in cds | 10/897 | 491/2406 |
SNP | peptide pairs | list of reference-mutant peptide pairs derived from this SNP, along with peptide mutation position. Corresopnding peptides will be found in the peptide output tables. | [('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, []), ('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, [])] | [('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, []), ('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, [])] |
SNP | peptide warnings | warnings from peptide generation for each variant. This will include all warnings for successfully generated peptides, as well as warnings for variants where peptides could not be generated | protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon | Reached end of frameshift mutation position may vary |
peptide | peptide pair id | A serial number for peptide pairs in the peptide-output table. | ||
peptide | transcript reference allele | reference allele (nucleotide) decoded from hgvs_dna | C | C |
peptide | transcript mutant allele | tumor allele (nucleotide) decoded from hgvs_dna | T | CC |
peptide | reference peptide | reference peptide with requested PEPTIDELENGTH | REEDIYQFAYCYPYTYTRFQ | LGSGVTVSSGQGIPPPTRGN |
peptide | mutated peptide | mutant peptide with requested PEPTIDELENGTH | REENIYQFAYCYPYTYTRFQ | LGSGVTVSSGQGIPPPYPGE |
peptide | peptide mutation position | peptide mutation position | 4 | 17 |
peptide | strand | transcript strand: 1 for sense and -1 for anti-sense strand | -1 | -1 |
peptide | warnings | any warnings associated with peptide generation for each reference-mutant peptide pair | protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon | Reached end of frameshift mutation position may vary |