Tutorial
This tutorial will show an example showcasing the usage of Cancermuts on a well-known autophagy marker protein, MAP1LC3B (or more simply, LC3B).
We will download associated cancer mutations from both cBioPortal and COSMIC and annotate proteins and mutations with all the data sources available in Cancermuts.
Preliminary operations
This tutorial expects that the Cancermuts Python package has been installed and is available to be imported. If you have followed the recommended installation procedure and installed Cancermuts in a Python environment you need to activate it first, as detailed in point 3 of the installation.
You also need to have downloaded the COSMIC mutation export file, as detailed in the installation section.
Finally, you can decide to perform this tutorial either interactively (i.e. running line by line in a Python command line) or as a script (i.e. saving the lines you need in a script and running it). The two options are equvalent from the point of view of the result. If you're going for the first option, we recommended that you install the IPython command line interface which is more user friendly than the regular Python interpreter:
and then just run ipython
to enter the interpreter:
lastly, you can check if your Cancermuts installation was successful by trying and import the cancermuts package:
you shouldn't receive any error. If you receive an error instead we recommend checking that your installation was performed correctly and/or open an Issue on our GitHub repository to receive assistance.
Introduction
Cancermuts works by interrogating different data sources to gather the requested information. These are all encoded in the datasources
module which includes one class for every data source available in cancermuts. The steps for obtaining a cancermuts annotation are roughly as follows:
Create a Sequence object by interrogating an appropriate data source. The Sequence object is the main container of information in cancermuts and links together all the available annotations
Download mutations and add them to the sequence object. This is done by appropriately using other data sources
Download annotations for either mutations (these are called metadata) or position or sequence properties (these are called properties). This is once again done by using appropriate data sources for the task.
Create a pandas data frame from the Sequence object. This will contain all the gathered information.
Create a visualization from the data frame
Tutorial script
The steps performed in the following tutorial are also written in a pre-built Python script available in the docs
folder, called tutorial.py
. In order to run it, you should have installed the Cancermuts package and activated the virtual environment in which it is installed (see Installation). This tutorial script follows the tutorial steps and has been written using a single cancer type as reference (colorectal cancer).
Depending on the location of the files required by Cancermuts on your system, you will need to edit the hardcoded paths in the script to match what is available to you
Once that is done, from the cancermuts/docs
directory, you can just run:
The result should be a metatable.csv
output file together with a my_table.pdf
figure.
We also provide a similar example which considers all available cancer types, in the tutorial_pancancer.py
file. It's run in the same way as with the other tutorial script:
Similarly as before, the result should be a metatable_pancancer.csv
output file together with a my_table_pancancer.pdf
figure.
Tutorial steps
The Sequence object
Cancermuts works The first operation to be performed when starting to work with Cancermuts is creating a Sequence object. This object represents the protein sequence we want to consider for the analysis and links together all the information we are going to collect. In order to create it we use a data source class for UniProt from which we can downlaod protein sequences. This is only source for sequences available at the moment, more can be added in the future.
To download the main UniProt sequence, you will need the corresponding HGCN symbol only. The UniProt object automatically identifies the most likely UniProt ID corresponding to the gene name. Alternatively, it is possible to provide the UniProt ID manually as an option.
This is done as follows:
Collecting cancer mutations
Now that we have the Sequence ready, we can start adding annotations to it. We will first download cancer mutations from cBioPortal and COSMIC.
We first import the required datasource classes for them:
cBioPortal
For cBioPortal, we first create the respective source object:
For this tutorial, we specify a list of cancer studies Cancermuts should use, which correspond to a few colorectal cancer studies. This is especially useful if we are interested in e.g. a certain cancer type or some other specific studies. Studies can be referred to using their cBioPortal study identifier.
Finding the correct set of identifier for the studies of interest can be tricky, as it isn't obvious from the cBioPortal website which study IDs are connected to specific studies. When creating a cBioPortal object without the cancer_studies
argument, two attributes of the object can be helpful to this regard. cb.cancer_types
is a list of cancer types as classified by cBioPortal and cb.cancer_studies
is the full list of studies available on cBioPortal. These list are regular pandas dataframes and can be manipulated as such, including saved to csv files or similar.
Alternatively, one can visit the cBioPortal datasets page which contains a list of all the available studies. While this list doesn't contain a column with study IDs, the web links that link the studies in the "Name" column do contain the IDs. For instance, the study named "Breast Invasive Carcinoma (Broad, Nature 2012)" links the page https://www.cbioportal.org/study?id=brca_broad
- meaning the corresponding study ID is brca_broad
.
It is also possible to initialize the cBioPortal data source object without specifying cancer studies:
In this case, Cancermuts will gather mutations from all cancer studies available in cBioPortal.
Finally, we use this object to gather the mutations from the selected cancer studies and add them to our Sequence object we created earlier:
It is possible to downloaded metadata about the downloaded aminoacid mutations as well by using the metadata
argument of add_mutations
, which supports a list of strings, one for each metadata type. This is usually recommended. By default no metadata are added. Supported metadata for cBioPortal are:
cancer_type
: type of cancer the mutation was found in, depending on the studycancer_study
: cBioPortal study the mutation was found ingenomic_coordinates
: Genomic coordinates of the corresponding genomic mutationgenomic_mutations
: Genomic coordinates and base pair substitution of the corresponding genomic mutation
So for instance:
the gene or protein name is not among the arguments - this is because it is inferred from the seq
object. Only mutations of the corresponding gene will be annotated.
This step might take a few minutes as Cancermuts interrogates the cBioPortal online database. The final result of this step will have modified the seq
Sequence object we have created by adding mutations to the positions of the respective residues. In this case, Cancermuts identified only three mutations for cBioPortal:
Each mutation is recorded in a Mutation object that can be further explored:
The cancer type is not always present in the data downloaded from cBioPortal. In this cases, Cancermuts tries its best to infer it from the study name.
COSMIC
As before, we first create a COSMIC data source object:
here the database_files
argument is a string. If the user wants to use more than one database file, these should be provided as a list of strings. Usually, the argument for this file would be the COSMIC database file that was downloaded as detailed in the Install section.
Similarly, database_encoding
defines the text file encoding for every file (it is latin1
for COSMIC version 95).
As the default database file is rather large, we recommend running this step on a computer with at least 32 GB of free memory. Otherwise, it is possible to filter the database file first, for instnace keeping only the rows that contain the gene name of interest. In bash this can be done by running:
and then using the resulting file as the database file
We can then add mutations from the COSMIC data source:
Here we restrict the search to those mutations that are involved in colorectal adenocarcinoma.
It is also possible to search in any available cancer type or site as well (i.e. pancancer), by not specifying cancer types or sites in the add_mutations
call, for instance:
It should be noted that COSMIC supports up to four cancer histology types and four cancer histology subtypes and Cancermuts allows to filter by any of this. In particular, only the mutations that correspond to all the selected criteria are retained. Please see the COSMIC phenotype classification for the classification that COSMIC uses.
Furthermore, similarly to cBioPortal, we retain some metadata:
cancer_histology
: histology information on the cancer(s) in which this mutation was foundcancer_site
: site information on the cancer(s) in which this mutation was foundgenomic_coordinates
: Genomic coordinates of the corresponding genomic mutationgenomic_mutations
: Genomic coordinates and base pair substitution of the corresponding genomic mutation
In this case, filtering the database allows to add only a single mutation, which is K65E. It should be noted that the same amino acid substitution was already identified by cBioPortal. This means that the mutation object corresponding to this aminoacid substitution is annotated with metadata for this newfound mutation. We see now that the details obtained by both cBioPortal and COSMIC are present:
Additional mutation metadata
Cancermuts allows to enrich the downloaded mutations with further metadata. We will see now how to download the REVEL pathogenicy score and the gnomAD allele frequency for each specific variant.
REVEL score from MyVariant
We will download scores for the REVEL predictor of pathogenicity from the MyVariant.info database. REVEL is associated to genomic mutations and not to amino-acid mutations, therefore our amino acid mutations need to be annotated with the genomic mutations metadata in order for this to be possible. This is performed as detailed above. In order to download the associated REVEL score, we can use the appropriate data source class:
we can check that the mutations have been annotated with the REVEL score:
In this case we see the same score twice, as the mutation was previously annotated with two genomic mutations. The two genomic mutations corresponded to the same mutations annotated in two different genome assemblies, therefore the two scores we are able to gather have the same value.
gnomAD allele frequencies
Similarly, we annotate mutations with their exome or genome allele frequencies as found in the gnomAD database. It is also possible to annotate with the highest allele frequency found in non-bottlenecked populations (referred to as popmax). This works as you would expect by now:
here, we specify the version
argument to specify the version of gnomAD to be considered. Please refer to the API documentation for all the available versions.
The md_type
keyword allows to select which metadata type(s) to annotate, i.e. choose between exome or genome allele frequency (or both as in the example) and whether to annotate the popmax as well. Note, that they can all be chosen independently of each other.
We calculate the allele frequency as the ratio between the total allele count over the allele number as found in gnomAD, if the entry for the corresponding variant is available.
The popmax allele frequency is found by calculating the exome and/or genome allele frequency from all supported non-bottlenecked populations and then finding the maximum frequency of those.
Finally we can check the downloaded metadata:
here Cancermuts could assign an exome allele frequency for the associated genomic mutations, however it wasn't able to download or calculate the corresponding genome allele frequency.
Position properties
We will further annotate our sequence object with properties connected to each single residue. For the moment, these are post-translation modifications from Phosphosite and order/disorder propensity from MobiDB.
We import the relative data source class, similarly as what done previously:
Post-translational modifications with phosphosite
We will annotate post-translational modifications identified in experiments from the PhosphoSite Plus database. In order to do so we need a local copy of the Phosphosite dataset - please see the instructions in the installation guide.
We first create the Phosphosite data source object. We will need to supply the location of the database files:
by default, Cancermuts expect the file names in the database to be the default ones. However it's possible to specify the file name for each post-translational modifications by supplying the database_files
argument, which should be a dictionary associating each PTM to a file name:
the keys of the dictionaries should be the options supported in the properties
argument of the add_position_properties
(see below).
Once the object is created we can add the position properties to our sequence object. If no properties
is supplied, all of them will be considered. Otherwise, the properties
object should be a list of the keywords corresponding to the PTMs that we want to be annotated, as follows:
So, for instance:
However, in this case we will consider all of them:
Once again, we can check the result:
structured regions with MobiDB
We use the MobiDB website to annotate disordered regions in our protein of interest using the consensus of the entry. This works similarly as before:
We can then check the annotation as done previously:
There are four types of annotations, here listed in order of priority.
Disordered, curated
Based on manually curated data
Disordered, derived
Based on primary data, e.g. PDB structures
Disordered, homology
Based on homology inference
Disordered, predicted
Based on predictions
If none of these data are available for the consensus, the residue will not be annotated. For more information about the types of evidence, please see the MobiDB vocabulary
Sequence properties
We further annotate properties to the sequence, i.e. properties that cover multiple residues. We first import the respective classes:
Predictions of short linear motifs from ELM
We annotate the sequence using predictions for short linear motifs from the Eukaryotic Linear Motif database:
here exclude_elm_classes
is a single regular expression that allows to remove specific ELM classes by specifying a regular expression that matches one or more ELM class identifier. Here we use "MOD_."
to exclude post-translational modifications which are already supplied by PhosphoSite.
We can check the linear motifs we have collected:
and their specifics:
Alternatively, and in a very similar fashion, it is possible to use the gget Python package to obtain short linear motifs definitions. The current implementation only considers "regexp" type of predictions from the full protein sequence. It should be noted that, unlike when using the ELM webserver as detailed above to obtain these data, no filtering is applied. For instance:
This can be useful when e.g. the ELM webserver is not available or for when many calls in a row are necessary. The ELM webserver requires a minimum 3-minute interval between queries which has been baked into the current implementation of the ELM data source.
Custom annotations
We can further add annotations manually to our dataset. This is for data that is not available in the databases as of yet, but is useful to have annotated in the pipeline to have a complete picture. This can be done by using a custom CSV file as data source. The CSV can contain different types and levels of information. The file should have the following columns, separated by ;
:
The column format changes depending on the type
:
If we want to annotate a new aminoacid substitution, then
type has to be
mutation
name
can be any stringsite
needs to be a HGVS-format protein variant specification, e.g.p.Ala398Tyr
a new column,
genomic_mutations
, can optionally be added to annotate genomic mutations metadata corresponding to the protein mutation. For each mutation, we can specify one mor more space-separated single-nucleotide substitutions in the HGVS format, preposed by hg19 or hg38 depending on the reference genome assembly (hg19/hg38). For instance,hg38,17:g.7673776G>A
. To actually use the information from the 'genomic_mutations' column, thegenomic_mutations
metadata should also be specified when callingManualAnnotation.add_mutations
See the examples below.
If we want to annotate a post-translational modification, then
type
should be one ofptm_phosphorylation
,ptm_ubiquitination
,ptm_acetylation
,ptm_sumoylation
,ptm_nitrosylation
,ptm_methylation
name
can be any stringsite
needs to be a single number, e.g.34
. This is the residue number in the sequence (1-based) on which the PTM is foundfunction
can be any stringreference
can be any string
If we want to annotate a novel short linear motif, then
type
should belinear_motif
site
should be a dash-separated residue range, i.e.10-25
to signify from residue 10 to 25 in the aminoacid sequence, 1-based, including extremitiesfunction
can be any stringreference
can be any string
If we want to annotate a structured region, then
type
should bestructure
site
should be a dash-separated residue range, i.e.10-25
to signify from residue 10 to 25 in the aminoacid sequence, 1-based, including extremitiesfunction
can be any stringreference
can be any string
For instance, this is a working example of the csv file (named test.csv
):
If you do not want to include the genomic_mutations data, then structure the csv file as follows.
using the CSV file works as you would expect:
We can then verify:
Generating the final table
Once the Sequence object has been annotated with the desired data and metadata we can generate a summary table containing all the information. This is done using the Table module:
We can then manipulate the dataframe as we see fit, e.g. by saving it as a csv file:
Plotting the content of the final table
Finally, we can use Cancermuts to obtain a graphical representation of the final data frame:
the function returns the figure and axes Matplotlib
objects relative to the figure that is being created. It also saves the final output as the file indicated by the fname
argument in pdf format. The plot is similar to the one shown and described in our publication. Different aspects of the plot can be customized specifying different arguments, the major ones being:
the
section_size
arguments control how many protein sequence position are described by each section of the plot, meaning it controls how many sections are generated. The argument is the number of desired positions (e.g.section_size=50
)several data types can be hidden or shown by changing options to
False
(do not show) orTrue
(show; default for most cases). These aremutations
,elm
,ptms
,structure
,structure_mobidb
,mutations_revel
by default, only ELMs overlapping mutation sites are plotted - this can be changed by using argument
mutation_elms_only=False
option
figsize
accepts a tuple of number (width and height) and allows to change size and proportion of the output figure (e.g.figsize=(4,5)
)
Last updated