BioFolD

Mut2Dis Periodic Report 1: Outgoing phase - Stanford University

Project Acronym: Mut2Dis
Project Code: PIOF-GA-2009-237225
Project Title: New methods to evaluate the impact of single point protein mutation on human health.

Publishable summary

1. Summary of the project objectives
In this report we summarize the research activity performed by Dr. Emidio Capriotti during the outgoing phase of the Marie-Curie IOF at the Department of Bioengineering, Stanford University under the supervision of Dr. Russ B. Altman.
The main aims of our proposal are the following:

i. Study and characterization of the rate of evolution of Single Nucleotide Polymorphisms and their effect in human disease.
ii. Study and characterization of the structural determinants of human disease.
iii. Development of new general machine learning methods for disease prediction.
iv. Development of disease-specific predictors.
v. Development of a World Wide Web server for predicting the likelihood of a SNP variant to be associated with human disease.

These 5 aims correspond to 6 different tasks that have to be accomplished in 36 months. In the proposal's timeline, we planned to perform about 4 over the 6 tasks during the outgoing phase (24 months). According to this, we mainly achieved the first 3 objectives and part of the 4th and 5th. The remaining parts of specific aims 4 and 5 will be performed during the returning phase at the University of Balearic Islands (Spain).

2. Description of the work performed since the beginning of the project
During the first months of the project Dr. Emidio Capriotti selected a set of annotated missense Single Nucleotide Variants (mSNVs) from the database SwissVar. The dataset used in this work has been downloaded at the end of October 2009. The selection of the subset of mSNVs for which the three-dimensional structure of the proteins is known, EC implemented programs able to automatically compare the sequences of the mutated proteins with the sequences of the protein collected in the Protein Data Bank (PDB). In a second phase EC performed a evolutionary analysis calculating the selective pressure acting at codon level using alignments between the human DNA sequences and their homolog in mammalian species. In the next steps Dr. Capriotti built a machine-learning base approaches to predict the impact of mSNVs evaluating the discriminative power of different features. In these algorithms we included features from sequence analysis such as evolutionary and functional information and protein structure information. In last period a disease-specific method have been developed to predict the cancer causing mSNVs.

3. Description of the achieved results
With the research activity performed during the outgoing phase, EC reached the largest part of the objectives described in our proposal. In particular, we defined a set of discriminative features derived from protein sequence profile and protein structure. This allowed to develop a new machine learning based method for the prediction of deleterious variant taking in input information from protein sequence profile, protein function and protein structure. The improvement of the prediction accuracy resulting from the use of structure information has been quantitatively estimated comparing the structure-based method with similar sequence-based tool.

4. Expected final results and their potential impact
After the returning phase we expect to have developed a user-friendly web server interface for the prediction of the effect of mSNVs. Currently, EC is implementing these web tools including both protein sequence and structure information. We believe that the use of structural information in the prediction of deleterious variants will be important for the understanding of the disease mechanism. In addition the developed method will have an impact in personal genomics allowing to make new hypothesis about the insurgence of genetic diseases. As natural consequence of this work we are planning to study the relationship between genetic variants and drug response. The application of newly developed tools in clinical settings will be important for the establishment of personalized medicine.

Complete version

Project objectives for the period

During the first two years of the outgoing phase the first three aims of the project have been accomplished. These objectives correspond to the first four tasks. In details after collecting a set of manually curated human variants, we analyzed them considering different evolutive information obtained aligning the protein sequence with a set of related proteins in other species. In the second period, we selected a subset of mutations for which structural information are available. Analyzing these mutations we identifying important feature to improve the prediction of deleterious variants. To accomplish the 3rd aim of the project, we developed different general machine learning-based approaches to discriminate between disease-related and neutral variants. After a testing procedure we have shown that protein structure features increase the quality of the prediction and provide information about the mechanism of the disease. During the last months EC started to collect a manually curated dataset of mutations related to cancer and characterize them from the functional point of view. In the next period, EC is planning to develop a disease-specific method for the prediction of cancer causing variants using sequence information. For more details about the performed activity during the outgoing phase please refer to the next section.

Work progress and achievements during the period

1. Progress towards objectives and details for each task
In this section we summarized the objectives achieved for each one of the five aims described in our proposal during the outgoing phase at the Stanford University.

1.1 Study and characterization of the rate of evolution of Single Nucleotide Polymorphisms and their effect in human disease.
To achieve the first objective of our proposal Dr Capriotti did two different type of analysis. The evolution of the protein in the mutated position was studied considering both alignments of similar protein sequences or DNA alignment between homolog genes. Using different features, these data confirm the idea the wild-type residues are more conserved in disease-related sites than neutral and the mutated residues appear more frequently in the positions of the multiple sequence alignment corresponding to neutral mutation with respect to those disease-related.

1.2 Study and characterization of the structural determinants of human disease.
After the characterization of mSNVs in terms conservation and evolution across species, we analyzed structural features of mutated residues. In this direction the first goal has been the development of an automatic method to map the mSNVm for protein sequence to structure. The protein structures and the mutated sites in the selected dataset have been analyzed to find new discriminative features. First we calculated the distributions of the relative solvent accessible area (RASA) for disease-related and neutral mSNVs and the occurrences in each secondary structural class. The results show a significant difference between the distributions of RASA for disease-related and neutral variants. More in details the disease-related mutations are more likely to happen in the core of the proteins and neutral ones on the surface.

1.3 Development of new general machine learning methods for disease prediction.
EC developed a sequence-based method. This algorithm is a Support Vector Machine trained on data extracted from protein sequence, profile, the output of PANTHER and a functional score derived using Gene Ontology terms of the protein under mutations and all their parents. The sequence-based SVM takes in input a 51 elements vector and a second SVM base method has been developed including structural information. Thus the sequence-base algorithm has been used as baseline to quantify the improvement of the prediction resulting from the use of structural information.

1.4 Development of disease-specific predictors
For this particular task EC is focusing on the detection of mutations involved in the insurgence of different types of cancers. Thus a set of manually curated driver variants associated to cancer has been selected.

2. Researcher training activities/transfer of knowledge activities/integration activities
In the period of the outgoing phase at Stanford University, EC was appointed as postdoc in the Department of Bioengineering. He had the opportunity to attend different courses. In addition he cooperated for specific projects with the researcher at the PharmGKB consortium that is interested to analyze the effect of genetic variants on drug response.

3. Highlight significant results
During the first phase, EC achieved many significant results related to the main objectives of the Mut2Dis project. First of all, EC analyzed large dataset of annotated mutations evaluating evolutionary and structural information. More practical results is the development of a new machine learning base approach based on protein structure information to predict the effect of missense single nucleotide variants.

4. Statement on the use of resources
For the development of this project during the last two years the University of Balearic Island had total expenses for 156.475,14 € (see table in the attached file). In the Form C attached to this report we included only the amount provided by the European Community (Total 152,627.84 €).

Complete version

Additional information

None

Dissemination activities

During the outgoing period EC dedicated part of the time to disseminate the results of this project in international conferences, workshops and in invited seminars in institutions both in US and Europe.To summarize the dissemination activity performed in the last two years, EC published one paper about the results obtained in this study (Capriotti and Altman, BMC Bioinformatics, 2011) and one review about the personalized challenges in personalized medicine (Fernald et al. Bioinformatics, 2011). He also submitted 10 posters to meetings and conferences, 6 of which have been selected for oral presentation. Finally EC was invited to give 12 seminars where he presented the results of the Mut2Dis research project. EC is also maintaining web page were details of there project are made available. In the returning phase, EC is planning to attend important meetings and conferences to advertise the web server for the prediction of deleterious variants that will be developed during the next year. In addition other papers and review are in preparation and we believe that will be published during the second phase.

Complete version

Project management

1. Project planning and status - from management point of view
During the outgoing phase all the aims and tasks of the project have been fulfilled according to the timeline described in the research proposal. From the management point of view, the available funds have been used to cover medical insurance expenses for EC during his stay in US and the expanses to apply for the US VISA. Part of the funds has been used to compensate the personal of the UIB involved in the management of this grant. During the returning phase it expected to have not any expenses for medical insurance.

2. Problems which have occurred and how they were solved or envisaged solutions
None

3. Changes to the legal status of any of the beneficiaries
None

4. Impact of possible deviations from the planned milestones and deliverables
None

5. Development of the project website
EC as beneficiary of the fellowship is maintaining and updating a dedicated web site where details and information about the project are reported (see http://snps.uib.es/mut2dis).

6. Gender issues; Ethical issues
None

7. Justification of subcontracting (if applicable)
There are not subcontracting expenses in this period.

8. Justification of real costs (management costs)
The management costs include:
i. Management of the contract between beneficiary and the University of Balearic Islands (3,004.20 Euro)
ii. The travel insurance fees for the beneficiary for two years (3,032.37 Euro).
iii. Costs of management of the US VISA for the beneficiary and fees (254.21 Euro).

9. Indirect costs
Overheads granted are 10% of the total direct costs (excluding management costs). The actual overheads at UIB used in FP7 are calculated using a simplified method, which includes all the indirect costs of the institution (communication costs, maintenance and depreciation of buildings and infrastructures, courier services, security services, electric power and water expenses, research support personnel an so on) and represent, for the year 2010, a rate of 81.01% of the personnel costs. This rate has already been audited. Only maximum reimbursable indirect costs have been included in Form C, because in this Grant Agreement indirect costs are calculated by a flat rate.

BioFolD