Both the web and command line interfaces to Ensembl VEP can use the same input formats.
Supported input formats:
NoteEnsembl VEP can read compressed (gzipped) input file of any format listed above, e.g.:
./vep -i input.vcf.gz -o output.txt
The default format is a simple whitespace-separated format (columns may be separated by space or tab characters), containing five required columns plus an optional identifier column:
1 881907 881906 -/C + 2 946507 946507 G/C + 5 140532 140532 T/C + 8 150029 150029 A/T + var2 12 1017956 1017956 T/A + 14 19584687 19584687 C/T - 19 66520 66520 G/A + var1
An insertion (of any size) is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:
8 12601 12600 -/C +
A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:
8 12600 12602 CGT/- -
Structural variants are also supported by indicating a structural variant type instead of the allele:
1 20000 30000 CN4 + cnv4 1 160283 471362 DUP + dup 1 1385015 1387562 DEL + del1 12 1017956 1017956 INV + inv 21 25587759 25587769 CN0 + del2
VCF (Variant Call Format) version 4.0 is supported. This is a common format produced by many variant calling tools and is the recommended format for use with Ensembl VEP:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 65568 . A C . . . . 1 230710048 rs699 A G . . . . 2 265023 . C T . . . . 3 319780 . GA G . . . . 20 3 . C CAAG,CAAGAAG . PASS . . 21 43762120 rs1300 T A,C,G . . . .
Structural variants are also supported depending on structural variant type.
Users using VCF should note a peculiarity in the difference between how Ensembl and VCF describe unbalanced variants. For any unbalanced variant (i.e. insertion, deletion or unbalanced substitution), the VCF specification requires that the base immediately before the variant should be included in both the reference and variant alleles. This also affects the reported position i.e. the reported position will be one base before the actual site of the variant.
In order to parse this correctly, Ensembl VEP needs to convert such variants into Ensembl-type coordinates, and it does this by removing the additional base and adjusting the coordinates accordingly. This means that if an identifier is not supplied for a variant (in the 3rd column of the VCF), then the identifier constructed and the position reported in Ensembl VEP output file will differ from the input.
This problem can be overcome with the following:
The following examples illustrate how VCF describes a variant and how it is handled internally by Ensembl VEP. Consider the following aligned sequences (for the purposes of discussion on chromosome 20):
Ref: a t C g a // C is the reference base 1 : a t G g a // C base is a G in individual 1 2 : a t - g a // C base is deleted w.r.t. the reference in individual 2 3 : a t CAg a // A base is inserted w.r.t. the reference sequence in individual 3Individual 1
The first individual shows a simple balanced substitution of G for C at base 3. This is described in a compatible manner in VCF and Ensembl styles. Firstly, in VCF:
20 3 . C G . PASS .
And in Ensembl format:
20 3 3 C/G +Individual 2
The second individual has the 3rd base deleted relative to the reference. In VCF, both the reference and variant allele columns must include the preceding base (T) and the reported position is that of the preceding base:
20 2 . TC T . PASS .
In Ensembl format, the preceding base is not included, and the start/end coordinates represent the region of the sequence deleted. A "-" character is used to indicate that the base is deleted in the variant sequence:
20 3 3 C/- +
The upshot of this is that while in the VCF input file the position of the variant is reported as 2, in the output file in Ensembl VEP default format the position will be reported as 3. If no identifier is provided in the third column of the VCF, then the constructed identifier will be:
20_3_C/-Individual 3
The third individual has an "A" inserted between the 3rd and 4th bases of the sequence relative to the reference. In VCF, as for the deletion, the base before the insertion is included in both the reference and variant allele columns, and the reported position is that of the preceding base:
20 3 . C CA . PASS .
In Ensembl format, again the preceding base is not included, and the start/end positions are "swapped" to indicate that this is an insertion. Similarly to a deletion, a "-" is used to indicate no sequence in the reference:
20 4 3 -/A +
Again, the output will appear different, and the constructed identifier may not be what is expected:
20_3_-/A
Using VCF format output, or adding unique identifiers to the input (in the third VCF column), can mitigate this issue.
For VCF entries with multiple alternate alleles, Ensembl VEP will only trim the leading base from alleles if all REF and ALT alleles start with the same base:
20 3 . C CAAG,CAAGAAG . PASS .
This will be considered internally by Ensembl VEP as equivalent to:
20 4 3 -/AAG/AAGAAG +
Now consider the case where a single VCF line contains a representation of both a SNV and an insertion:
20 3 . C CAAAG,G . PASS .
Here the input alleles will remain unchanged, and Ensembl VEP will consider the first REF/ALT pair as a substitution of C for CAAG, and the second as a C/G SNV:
20 3 3 C/CAAG/G +
To modify this behaviour, with the commandline tool you can use --minimal. This flag forces Ensembl VEP to consider each REF/ALT pair independently, trimming identical leading and trailing bases from each as appropriate. Since this can lead to confusing output regarding coordinates etc, it is not the default behaviour. It is recommended to use the --allele_number flag to track the correspondence between alleles as input and how they appear in the output.
See https://varnomen.hgvs.org for details. These must be relative to genomic or Ensembl transcript coordinates.
It also is possible to use RefSeq transcripts, if they match the reference genome See HGVS documentation
Examples:
ENST00000618231.3:c.9G>C ENST00000471631.1:c.28_33delTCGCGG ENST00000285667.3:c.1047_1048insC 5:g.140532G>C
Examples using RefSeq identifiers (using --refseq in the command line or select the 'RefSeq transcripts' on the web interface:
NM_153681.2:c.7C>T NM_005239.6:c.190G>A NM_001025204.2:c.336G>A
HGVS protein notations may also be used, provided that they unambiguously map to a single genomic change. Due to redundancy in the amino acid code, it is not always possible to work out the corresponding genomic sequence change for a given protein sequence change. The following example is for a permissable protein notation in dog (Canis familiaris):
ENSCAFP00000040171.1:p.Thr92AsnAmbiguous gene-based descriptions
It is possible to use ambiguous descriptions listing only gene symbol or UniProt accession and protein change (e.g. PHF21B:p.Tyr124Cys, P01019:p.Ala268Val), as seen in the literature, though this is not recommended as it can produce multiple different variants at genomic level. The transcripts for a gene are considered in the following order:
and the first compatible transcript is used to map variants to the genome for annotation.
These should be dbSNP rsIDs (such as rs699), or any synonym for a variant present in the Ensembl Variation database. Structural variant identifiers (like nsv1000164 and esv1850194) are also supported.
See here for a list of identifier sources in Ensembl.
Examples:
rs1156485833 rs1258750482 rs867704559 esv1815690 nsv1000164Note
Ensembl VEP is optimised for the analysis of variants in VCF files (sorted by location) or other position-based formats. Using variant identifiers can take 2-5 times longer as a database lookup to find the variant location is required.
Genomic SPDI notation which uses four fields delimited by colons S:P:D:I (Sequence:Position:Deletion:Insertion) is also supported. In SPDI notation, the position refers to the base before the variant, not the base of the variant itsef.
See here for details.
Examples:
NC_000016.10:68684738:G:A NC_000017.11:43092199:GCTTTT: NC_000013.11:32315789::C NC_000016.10:68644746:AA:GTA 16:68684738:2:AC
The Ensembl VEP region REST endoint requires variants are described as [chr]:[start]-[end]:[strand]/[allele]
.
This follows the same conventions as the default input format, with the key difference being that this format does not require the reference (REF) allele to be included; this will be looked up using either a provided FASTA file (preferred) or Ensembl core database. Strand is optional and defaults to 1 (forward strand).
# SNP 5:140532-140532:1/C # SNP (reverse strand) 14:19584687-19584687:-1/T # insertion 1:881907-881906:1/C # 5bp deletion 2:946507-946511:1/-
Structural variants are also supported by indicating a structural variant type in the place of the [allele]
:
# structural variant: deletion 21:25587759-25587769/DEL # structural variant: inversion 21:25587759-25587769/INV
Ensembl VEP also predicts molecular consequences for structural variants using the following input formats:
To recognise a variant as a structural variant, the allele string (or SVTYPE
in the INFO column of the VCF format) must be set to one of the currently supported values:
INFO
fields describing the tandem repeat, such as RUS
and RN
– check VCF 4.4 specification, section 5.7CIRUC
and CIRB
INFO
fields are ignored when calculating alternative alleles in tandem repeatsALT
column and need to meet the HTS specifications, such as TG[12:58877476[ALT
, such as T. and .GALT
, such as A[22:22893780[,A[X:10932343[More information on how Ensembl VEP processes structural variants can be found here.
Examples of structural variants encoded in VCF format#CHROM POS ID REF ALT QUAL FILTER INFO 1 160283 dup . <DUP> . . SVTYPE=DUP;END=471362 1 1385015 del . <DEL> . . SVTYPE=DEL;END=1387562 1 7936271 bnd N N[12:58877476[ . . SVTYPE=BND
See the VCF definition document for more detail on how to describe structural variants in VCF format.
Ensembl VEP can return the results in different formats:
Along with the results Ensembl VEP computes and returns some statistics.
The default output format ("VEP" format when downloading from the web interface) is a 14 column tab-delimited file. Empty values are denoted by '-'. The output columns are:
Example of Ensembl VEP default output format:
11_224088_C/A 11:224088 A ENSG00000142082 ENST00000525319 Transcript missense_variant 742 716 239 T/N aCc/aAc - SIFT=deleterious(0);PolyPhen=unknown(0) 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000534381 Transcript 5_prime_UTR_variant - - - - - - - 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000529055 Transcript downstream_variant - - - - - - - 11_224585_G/A 11:224585 A ENSG00000142082 ENST00000529937 Transcript intron_variant - - - - - - HGVSc=ENST00000529937.1:c.136-346G>A 22_16084370_G/A 22:16084370 A - ENSR00000615113 RegulatoryFeature regulatory_region_variant - - - - - - -
The Ensembl VEP command line tool will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.
## ENSEMBL VARIANT EFFECT PREDICTOR v114.0 ## Output produced at 2017-03-21 14:51:27 ## Connected to homo_sapiens_core_114_38 on ensembldb.ensembl.org ## Using cache in /homes/user/.vep/homo_sapiens/114_GRCh38 ## Using API version 114, DB version 114 ## polyphen version 2.2.2 ## sift version sift5.2.2 ## COSMIC version 78 ## ESP version 20141103 ## gencode version GENCODE 25 ## genebuild version 2014-07 ## HGMD-PUBLIC version 20162 ## regbuild version 16 ## assembly version GRCh38.p7 ## ClinVar version 201610 ## dbSNP version 147 ## Column descriptions: ## Uploaded_variation : Identifier of uploaded variant ## Location : Location of variant in standard coordinate format (chr:start or chr:start-end) ## Allele : The variant allele used to calculate the consequence ## Gene : Stable ID of affected gene ## Feature : Stable ID of feature ## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature ## Consequence : Consequence type ## cDNA_position : Relative position of base pair in cDNA sequence ## CDS_position : Relative position of base pair in coding sequence ## Protein_position : Relative position of amino acid in protein ## Amino_acids : Reference and variant amino acids ## Codons : Reference and variant codon sequence ## Existing_variation : Identifier(s) of co-located known variants ## Extra column keys: ## IMPACT : Subjective impact classification of consequence type ## DISTANCE : Shortest distance from variant to transcript ## STRAND : Strand of the feature (1/-1) ## FLAGS : Transcript quality flags
The --tab specifies a tab-delimited output file.
This differs from the default output format in that each individual field from the "Extra" field is written to a separate tab-delimited column.
This makes the output more suitable for import into spreadsheet programs such as Excel.
Furthermore the header is the same as the one for the Ensembl VEP default output format and this is also the format used when selecting the "TXT" option on the Ensembl VEP web interface.
Example of tab-delimited output format:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND FLAGS 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000525319 Transcript missense_variant 742 716 239 S/I aGc/aTc - MODERATE - -1 - 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000534381 Transcript downstream_gene_variant - - - - - - MODIFIER 1674 -1 - 11_224088_C/A 11:224088 A ENSG00000142082 ENST00000529055 Transcript downstream_gene_variant - - - - - - MODIFIER 134 -1 - 11_224585_G/A 11:224585 A ENSG00000142082 ENST00000529937 Transcript intron_variant,NMD_transcript_variant - - - - - - MODIFIER - -1 -Note
The Existing_variation column is only populated if we use option --check_existing or any other option that switches it on (--af_1kg silently switches on --check_existing).
The choice and order of columns in the output may be configured using --fields. For instance:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --tab --fields "Uploaded variation,Location,Allele,Gene"
The Ensembl VEP commandline tool can also generate VCF output using the --vcf flag.
Main information about the VCF output format:
Here is a list of the (default) fields you can find within the CSQ field:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_IDNote
Some fields are not reported by default and need configuring including HGVS, Symbol and Existing_variation.
Example command using the --vcf and --fields options:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --vcf --fields "Allele,Consequence,Feature_type,Feature"
VCFs produced by Ensembl VEP can be filtered using filter_vep.pl in the same way as standard format output files.
If the input format was VCF, the file will remain unchanged save for the addition of the CSQ field and the header (unless using any filtering). If an existing CSQ field is found, it will be replaced by the new annotation (use --keep_csq to preserve it).
Custom data added with --custom are added as separate fields, using the key specified for each data file.
Commas in fields are replaced with ampersands (&) to preserve VCF format.
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position"> #CHROM POS ID REF ALT QUAL FILTER INFO 21 26978790 rs75377686 T C . . CSQ=C|missense_variant|MODERATE|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|2/8||ENST00000419219.1:c.251A>G|ENSP00000404426.1:p.Asn84Ser|260|251|84
Ensembl VEP can output serialised JSON objects using the --json flag. JSON is a serialisation format that can be parsed and processed easily by many packages and programming languages; it is used as the default output format for Ensembl's REST server.
Each input variant is reported as a single JSON object which constitutes one line of the output file. The JSON object is structured somewhat differently to the other output formats, in that per-variant fields (e.g. co-located existing variant details) are reported only once. Consequences are grouped under the feature type that they affect (Transcript, Regulatory Feature, etc). The original input line (e.g. from VCF input) is reported under the "input" key in order to aid aligning input with output. When using a cache file, frequencies for co-located variants are reported by default (see --af_1kg, --af_gnomade).
Here follows an example of JSON output (prettified and redacted for display here):
{ "input": "1 1918090 test1 A G . . .", "id": "test1", "seq_region_name": "1", "start": 1918090, "end": 1918090, "strand": 1, "allele_string": "A/G", "most_severe_consequence": "missense_variant", "colocated_variants": [ { "id": "COSV57068665", "seq_region_name": "1", "start": 1918090, "end": 1918090, "strand": 1, "allele_string": "COSMIC_MUTATION" }, { "id": "rs28640257", "seq_region_name": "1", "start": 1918090, "end": 1918090, "strand": 1, "allele_string": "A/G/T", "minor_allele": "G", "minor_allele_freq": 0.352, "frequencies": { "G": { "amr": 0.5072, "gnomade_sas": 0.3635, "gnomade": 0.481, "gnomade_remaining": 0.4536, "gnomade_asj": 0.3939, "gnomade_nfe": 0.5042, "gnomade_afr": 0.0975, "afr": 0.053, "gnomade_amr": 0.5568, "gnomade_fin": 0.4751, "sas": 0.3906, "gnomade_eas": 0.4516, "eur": 0.4901, "eas": 0.4623, "gnomade_mid: "0.3306" } } } ], "transcript_consequences": [ { "variant_allele": "G", "consequence_terms": [ "missense_variant" ], "gene_id": "ENSG00000178821", "transcript_id": "ENST00000310991", "strand": -1, "cdna_start": 436, "cdna_end": 436, "cds_start": 422, "cds_end": 422, "protein_start": 141, "protein_end": 141, "codons": "aTg/aCg", "amino_acids": "M/T", "polyphen_prediction": "benign", "polyphen_score": 0.001, "sift_prediction": "tolerated", "sift_score": 0.22, "hgvsp": "ENSP00000311122.3:p.Met141Thr", "hgvsc": "ENST00000310991.8:c.422T>C" } ], "regulatory_feature_consequences": [ { "variant_allele": "G", "consequence_terms": [ "regulatory_region_variant" ], "regulatory_feature_id": "ENSR00000000255" } ] }
In accordance with JSON conventions, all keys (except alleles) are lower-case. Some keys also have different names and structures to those found in the other Ensembl VEP output formats:
Key JSON equivalent(s) Notes Consequence consequence_terms Gene gene_id Feature transcript_id, regulatory_feature_id, motif_feature_id Consequences are grouped under the feature type they affect ALLELE variant_allele SYMBOL gene_symbol SYMBOL_SOURCE gene_symbol_source ENSP protein_id OverlapBP bp_overlap OverlapPC percentage_overlap Uploaded_variation id Location seq_region_name, start, end, strand The variant's location field is broken down into constituent coordinate parts for clarity. "seq_region_name" is used in place of "chr" or "chromosome" for consistency with other parts of Ensembl's REST API *_maf *_allele, *_maf cDNA_position cdna_start, cdna_end CDS_position cds_start, cds_end Protein_position protein_start, protein_end SIFT sift_prediction, sift_score PolyPhen polyphen_prediction, polyphen_scoreEnsembl VEP writes an HTML file containing statistics pertaining to the results of your job; it is named [output_file]_summary.html (with the default options the file will be named variant_effect_output.txt_summary.html). To view it, please open the file in your web browser.
The page contains several sections:
General statisticsThis section contains two tables. The first describes the cache and/or database used, the version of Ensembl VEP, species, command line parameters, input/output files and run time. The second table contains information about the number of variants, and the number of genes, transcripts and regulatory features overlapped by the input.
Charts and tablesThere then follows several charts, most with accompanying tables. Tables and charts are interactive; clicking on a row to highlight it in the table will highlight the relevant segment in the chart, and vice versa.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4