Mutated Peptide Generator
The Mutated Peptide Generator tool will take a SNPEff-annotated VCF as input and generate predicted neo-peptides and the reference / WT peptides with which they pair.
VCF Input Files
This tool accepts VCF files as input. We list a few recommended steps to take with input VCFs before running this tool.
The last line of the VCF headers should looks similar to:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT {SAMPLE1} {SAMPLE2} ... {SAMPLEN}
The ‘SAMPLE’ columns are not required.
Recommended: Strip Unnecessary Annotations
Stripping unnecessary annotations from the VCF is not required, but recommended as it will help to ensure a consistent
starting point. Annotations can be stripped with vcftools
. For example:
vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| bgzip -c > infile.stripped.vcf.gz
# index the VCF
tabix infile.stripped.vcf.gz
Recommended: Decomposition and Normalization
Normalization and decomposition of the VCF are highly recommended, but not required. This step should be done before annotation with SNPEff, using vt. Here is an example command:
zcat < infile.stripped.vcf.gz \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz
# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz
The above steps can be combined into one, e.g.:
vcftools \
--gzvcf infile.vcf.gz \
--recode \
--stdout \
| vt decompose -s - \
| vt normalize -r $REFERENCE - \
| bgzip -c infile.stripped.decomposed.normalized.vcf.gz
# index the VCF
tabix infile.stripped.decomposed.normalized.vcf.gz
REFERENCE
should point to the genomic fasta file, used for sequence alignments. For instance, Homo_sapiens.GRCh38.dna.primary_assembly.fa
.
Recommended: SNPEff annotation
Although we provide the option to run SNPEff annotation on unannoated VCFs, we recommend that users run this before uploading VCFs to the tool.
Other annotators are available, but the MPG tool will only work with SNPEff annotations.
In order to run SNPEff, you will need to prepare a small file that includes the name of the ‘normal’ and ‘tumor’ samples as they appear in the VCF file. In the command below, this is our ‘in.samples’ file. Here is the recommended command, for annotation with GRCh38.86 reference genome:
echo -e "normal_sample\ttumor_sample" > in.samples
zcat < infile.stripped.decomposed.normalized.vcf.gz \
| java -Xmx16G -jar snpEff.jar -cancer -cancerSamples in.samples GRCh38.86 \
| > infile.ann.vcf
Parameter Selection
Peptide Length
The number of amino acids in the peptide
Peptide 1 and 2 Mutation Position
Position in the peptide where the mutation should be located. By default, only one peptide per variant will be created. If you wish to create an additional peptide for a given variant with the SNP at a different position, fill in the ‘Peptide 2 Mutation Position’ field.
Frameshift Overlap
Frameshift mutations often result in a relatively long stretch of amino acids that are different from the reference. This tool will break that long stretch into overlapping peptides with an overlap length specified by this parameter.
Maximum Peptide Length
There are several instances where creating a peptide longer than the ‘Peptide Length’ parameter may be desirable. For instance, in-frame insertions of a few amino acids might require extending the peptide to keep the C termini of the reference and mutant peptides aligned. Additionally, if a frameshift results in a mutated sequence that is only several amino acids longer than the ‘peptide length’, it might be desirable to create a longer peptide rather than break it up into highly overlapping peptides. Finally, when a variant is near the termini and multiple positions are selected, simply generating a single longer peptide rather than multiple, short, highly-overlapping peptides may be desirable.
Reference Genome
Options include GRCh38 (default), GRCh37, and GRCm38/mm10. The selected genome should match the genome used to generate the VCF file.
run SNPeff annotation
If checked, SNPEff will be executed against the VCF file before running through the peptide generation tool. We recommend that users take care of this step themselves before uploading the VCF, as it can be time-consuming.
Results
Three tables are included in the output:
SNPs - One row per SNP.
Peptide - One row per SNP and affected transcript.
Unique Peptide - One row per SNP and unique peptide. Peptides that can be produced by multiple transcripts will be collapsed in this output and a representative transcript is selected.
Each of the outputs contains many columns, which are described in detail below.
Column Definitions
output table | column | definition | example1 | example2 |
---|---|---|---|---|
all | chr | chromosome | chr1 | chr19 |
all | position | chromosomal position of mutation | 49045703 | 6477239 |
all | reference nucleotide | reference nucleotide | C | A |
all | mutated nucleotide | mutant nucleotide | T | AG |
all | mutation effect | predicted mutation effect (e.g., missense_variant, frameshift_variant, inframe_insertion, inframe_deletion, etc.) | missense_variant | frameshift_variant |
all | gene name | HGNC gene symbol | AGBL4 | DENND1C |
all | Ensembl gene accession | Ensembl gene identifier | ENSG00000186094 | ENSG00000205744 |
all | Ensemble transcript accession | Ensembl transcript identifier | ENST00000416121 | ENST00000381480 |
all | reference aa | reference amino acid | Asp | Thr |
all | mutated aa | mutated amino acid | Asn | fs |
all | protein position | mutation position in protein | 4/298 | 164/801 |
all | variant id | an internal unique identifier assigned to each variant | 11089 | 116919 |
all | mutation impact | SNPEff-predicted variant impace LOW/MODERATE/HIGH. | MODERATE | HIGH |
all | transcript biotype | a classification of the transcript type. These will include protein_coding, the different IG _ and TR _ types, as well as nonsense_mediated_decay, non_stop_decay, pseudogene, etc. | protein_coding | protein_coding |
all | transcript mutation code | mutation in hgvs format (nucleotide level) with coordinates based on the transcript | c.10G>A | c.491dupC |
all | protein mutation code | mutation in hgvs format (amino acid level) with coordinates based on the protein | p.Asp4Asn | p.Thr165fs |
all | cdna position | mutation position in cdna | 12/3938 | 604/2816 |
all | cds position | mutation position in cds | 10/897 | 491/2406 |
SNP | peptide pairs | list of reference-mutant peptide pairs derived from this SNP, along with peptide mutation position. Corresopnding peptides will be found in the peptide output tables. | [('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, []), ('REEDIYQFAYCYPYTYTRFQ', 'REENIYQFAYCYPYTYTRFQ', 4, [])] | [('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, []), ('LGSGVTVSSGQGIPPPTRGN', 'LGSGVTVSSGQGIPPPYPGE', 17, [])] |
SNP | peptide warnings | warnings from peptide generation for each variant. This will include all warnings for successfully generated peptides, as well as warnings for variants where peptides could not be generated | protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon | Reached end of frameshift mutation position may vary |
peptide | peptide pair id | A serial number for peptide pairs in the peptide-output table. | ||
peptide | transcript reference allele | reference allele (nucleotide) decoded from hgvs_dna | C | C |
peptide | transcript mutant allele | tumor allele (nucleotide) decoded from hgvs_dna | T | CC |
peptide | reference peptide | reference peptide with requested PEPTIDELENGTH | REEDIYQFAYCYPYTYTRFQ | LGSGVTVSSGQGIPPPTRGN |
peptide | mutated peptide | mutant peptide with requested PEPTIDELENGTH | REENIYQFAYCYPYTYTRFQ | LGSGVTVSSGQGIPPPYPGE |
peptide | peptide mutation position | peptide mutation position | 4 | 17 |
peptide | strand | transcript strand: 1 for sense and -1 for anti-sense strand | -1 | -1 |
peptide | warnings | any warnings associated with peptide generation for each reference-mutant peptide pair | protein sequence start with X, mutation position in peptide not desired because the mutation is near the start codon | Reached end of frameshift mutation position may vary |