BioFolD

Mut2Dis Periodic Report 2: Returning phase - University of Balearic Islands

Project Acronym: Mut2Dis
Project Code: PIOF-GA-2009-237225
Project Title: New methods to evaluate the impact of single point protein mutation on human health.

Publishable summary

1. Summary of the project objectives
In this report we summarize the research activity performed by Dr. Emidio Capriotti during the returning phase of the Marie-Curie IOF at the Department of Mathematics and Computer Science, University of Balearic Islands under the supervision of Dr. Jairo Rocha.
The main aims of our proposal are the following:

i. Study and characterization of the rate of evolution of Single Nucleotide Polymorphisms and their effect in human disease.
ii. Study and characterization of the structural determinants of human disease.
iii. Development of new general machine learning methods for disease prediction.
iv. Development of disease-specific predictors.
v. Development of a World Wide Web server for predicting the likelihood of a SNP variant to be associated with human disease.

These 5 aims correspond to 6 different tasks that have to be accomplished in 36 months. In the proposal's timeline, during the returning phase (12 months), we planned to perform the two final 2 tasks. According to this, we mainly accomplished the 4th and 5th. In conclusion during the whole period of the project (36 months) we achieved all the five objectives described in our proposal.

2. Description of the work performed since the beginning of the project
During the last 12 months of the project Dr. Emidio Capriotti developed a new methods for the prediction of disease-specific mutations focusing on cancer. In addition, he implemented two different web servers for the predictions of disease-related variants. In details EC selected a manually curated set cancer driver missense Single Nucleotide Variants (mSNVs). This dataset previously used to train another method (Carter at al, Cancer Research 2009) and analyzed it performing sequence analysis of the protein under mutation. For each protein the sequence profile has been calculated using similar protein retrieved using the BLAST algorithm. Using all sequence information previously calculated EC developed a machine learning approach to discriminate between cancer causing and neutral variants. For this particular task only sequence information has been used because the number of cancer mutations for which protein three-dimensional structure was available were not enough abundant to train a machine learning method. Finally, EC implemented two web servers: the first one more general for the prediction of disease-related mSNVs and the second one more specific for the detection of cancer causing mSNVs.

3. Description of the achieved results
The research activity performed during the returning phase reached all the aims described in our proposal. In particular, it has been demonstrated that for diseases for which a good number of annotated mutations are available it is possible to build disease-specific predictors. In particular we tested this hypothesis in the case of cancer-causing mSNVs showing that the disease-specific methods reaches performs better than the general method. In addition we implemented web available version of the method that can be used by the scientific community to evaluate possible deleterious mutations in human.

4. Expected final results and their potential impact
At the end of the returning phase we have developed a user-friendly web server interface for the prediction of the effect of mSNVs. The implemented web tools include a general method for the detection of disease-related variants that uses both protein sequence and structure information and a cancer specific algorithm that takes in to account only sequence information. In conclusion we demonstrated that structural information is important to improvement the prediction of deleterious variants. When structural information is not available but a good set of mutations have been annotated, the functional information is important to improve the performance of the predictors on a specific class of diseases. We believe that in the near future, when more mSNVs data will be available, the development of disease-specific methods will be key strategy for the development of more accurate algorithms and for the understanding of the disease mechanism.

Complete version

Project objectives for the period

During the last year in the returning phase the last two aims of the project have been accomplished. These objectives correspond to the last two tasks. More in details a set of manually curated driver cancer variants have been selected and analyzed considering the evolutionary information derived from a set of related sequences retrieved by BLAST algorithm. In addition, the analysis of the functional information using a subset of reduced Gene Ontology terms (GO slim) has been used to characterize particular functions that are more frequent in proteins related to cancer. With this work we successfully accomplished the 4th aim of our proposal developing a Support Vector Machine based method able to discriminate between cancer driving mSNVs and neutral polymorphisms. In the last period of the grant EC accomplished the 5th objective of the Mut2Dis project developing different web servers for the prediction of the impact of deleterious mutations. For more details about the performed activity during the returning phase please refer to attached file in the next section.

Work progress and achievements during the period

1. Progress towards objectives and details for each task
In this section we summarized the objectives achieved for each one of the last two aims described in our proposal during the outgoing phase at the University of Balearic Islands.

1.1 Development of disease-specific predictors.
For the accomplishment of the 5th task, EC started in the last part of the outgoing phase collecting cancer related mSNVs selecting only mutations with disease names associated to the MESH term “neoplasm”. During the returning phase the previous set was compared with a manually curated dataset of cancer driver mutations to select a set of cancer-causing mutations and remove possible passenger cancer mSNVs not directly cause of the pathological state. Using these data, EC analyzed compared the sequence profile in the mutated position for the set of cancer-causing mutations and an equal set of mSNVs in SwissVar that are not associated to any diseases that have been used as negative set. In the next step the frequency particular class of protein function in the subset of cancer-causing mutations has been compared with similar set of randomly selected disease-related mSNVs not associated to cancer. Finally most discriminative features has been selected and used to train and test a binary classifier able to discriminate between cancer-causing and non cancer-causing mSNVs.

1.2 World Wide Web server for the disease-related mutation prediction.
During the last period of the returning phase, to accomplish the 6th task, EC implemented different web servers to make available to the scientific community the methods developed in this project. In detail, EC implemented an updated version of the SNPs&GO algorithm that predicts the effect of mSNVs using only sequence information. According to the findings of this research activity a new version of the SNPs&GO algorithm that takes in to account protein structure information (SNPs&GO3d) has been made available on the web. SNPs&GO server and its implementation based on protein three-dimensional structure is reachable at http://snps.uib.es/snps-and-go. The promising results obtained in the analysis of cancer driver mutations have been used to implement a web server for predicting the cancer causing mSNVs (Dr. Cancer). The Dr Cancer web server is available at http://snps.uib.es/drcancer.

2. Researcher training activities/transfer of knowledge activities/integration activities
In the period of the returning phase at University of Balearic Islands, EC was contracted researcher in the Department of Mathematics and Computer Science. EC attended the Bologna Winter School 2012, a 5-day course dedicate to the study of the proteins and their variants from the structural and functional point of view. He also had the opportunity to attend the course of Optimization held by Dr Jairo Rocha. There has been also the opportunity of collaboration with other members Computational Biology and Bioinformatics Research Group to perform a statistical analysis for the detection of high discriminative sequence and structure based feature included in our algorithms.

3. Highlight significant results
During the second phase, EC achieved many significant results related to the main aims of the Mut2Dis project. First of all, EC analyzed large dataset of cancer-causing mSNVs evaluating evolutionary and functional information to discriminate them from neutral polymorphisms and other disease-related mutations. The results have shown that residue conservation in the mutated site from the protein sequence profile is one the best discriminative features. This finding has been also verified also comparing subset of cancer-causing mSNVs and polymorphisms. We have also shown that cancer-specific GO scores are more accurate that general GO-term ones in the identification of cancer-related protein, improving the detection of cancer-causing mSNVs. Finally, the new version SNPs&GO algorithm resulting from this research project has been scored between the best in its category either in testes performed by other groups (Thusberg et al. Human Mutation 2011) and in the blind set of mutations on CHK2 released by the Critical Assessment for Genome Interpretation (CAGI) organizers during the last two editions. The great interest of the international scientific community on our methods is shown from the geographic (http://snps.uib.es/) and the numeric (http://snps.uib.es/awstats/awstats.p) representations of the access to the http://snps.uib.es/ web server during the last few years.

4. Statement on the use of resources
For the development of this project during the returning phase the University of Balearic Island had total expenses for 64,976.67 € (see table in the attached file).

Complete version

Additional information

More information about the servers and dataset used for testing our algorithms are available on the wed pages of the servers:
More information about the results of the research activity is included in the Cancer Genomics paper published during the last year (see attachment).

Complete version

Dissemination activities

During the returning period EC dedicated part of the time to disseminate the results of this project in international conferences, workshops and in invited seminars in institutions both in US and Europe. To summarize the dissemination activity performed during the last year, EC published one paper about the results obtained analyzing cancer-causing missense Single Nucleotide Variants (mSNVs) (Capriotti and Altman, Genomics, 2011) and another paper about the prediction the deleterious effect of mSNVs detected in a family quartet (Dewey et.al., PLOS Genetics, 2012) and two reviews about the future perspective in personal genomics (Capriotti et al., Briefings in Bioinformatics, 2012) and the use of protein structure information for the detection of mSNVs affecting drug response (Lahti et al., Journal of Royal Society Interface, 2012). In collaboration with other colleagues, EC also submitted 3 posters to meetings and conferences, 2 of which have been orally presented by collaborators. Finally, EC was invited to give 4 seminars where he presented the results of the Mut2Dis research project. EC is also maintaining web page were details of there project are made available. It is expected that other papers and reviews related to this research project currently in preparation will be published during the next few months.

Complete version

Project management

1. Project planning and status - from management point of view
During the returning phase all the aims and tasks of the project have been fulfilled according to the timeline described in the research proposal. Dr Jairo Rocha managed the scientific part of the project supervising the research activity of Emidio Capriotti and Xavier Garcia supervised the economical part of the project checking and keeping track of the budget for the realization of the project.

2. Problems which have occurred and how they were solved or envisaged solutions
None

3. Changes to the legal status of any of the beneficiaries
None

4. Impact of possible deviations from the planned milestones and deliverables
None

5. Development of the project website
EC as beneficiary of the fellowship is maintaining and updating a dedicated web site where details and information about the project are reported (see http://snps.uib.es/mut2dis).

6. Gender issues; Ethical issues
None

7. Justification of subcontracting (if applicable)
There are not subcontracting expenses in this period.

8. Justification of real costs (management costs)
The management expenses consist in the management of the contract between beneficiary and the University of Balearic Islands (2,204.70 €).

9. Indirect costs
Overheads granted are 10% of the total direct costs. The actual overheads at UIB used in FP7 are calculated using a simplified method, which includes all the indirect costs of the institution (communication costs, maintenance and depreciation of buildings and infrastructures, courier services, security services, electric power and water expenses, research support personnel an so on) and represent, for the year 2011, a rate of 81.09% of the personnel costs. This rate has already been audited.

BioFolD