Genomic Data Analysis
title: “Personalised Medicine - Exploratory Data Analysis” —
Introduction
Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).
Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.
We have been challenged to automatically classify genetic mutations that contribute to cancer tumor growth (so-called “drivers”) in the presence of mutations that are don’t affect the tumors (“passengers”).
The data comes in 4 different files. Two csv files and two text files:
training/test variants: These are csv catalogues of the gene mutations together with the target value Class, which is the (manually) classified assessment of the mutation. The feature variables are Gene, the specific gene where the mutation took place, and Variation, the nature of the mutation. The test data of course doesn’t have the Class values. This is what we have to predict. These two files each are linked through an ID variable to another file each, namely:
training/test text: Those contain an extensive description of the evidence that was used (by experts) to manually label the mutation classes.
The text information holds the key to the classification problem and will have to be understood/modelled well to achieve a useful accuracy.
Load libraries and data files
Data Input Libraries
Data Wrangling/Manipulation libraries
Data Visualization Libraries
Reading the data into R
Data Exploration
We start this EDA by Exploring our target variable.
Target Variable
Findings 1
- There are 9 unique classes, with class 7 being the most frequent class and class 8 being the least frequent class.
- The class distrubution is not uniform. class 3, 8, and 9 have very low count, class 5,6 have a medium count and class 1,2,4 and 7 have relatively higher count
Creating Data tables to explore the variants in the data
Findings-2
- The Training dataset comprises of 3321 unique IDs with 264 different Gene Expressions and 2996 different variations. Similarly, the test dataset comprises of 5568 unique IDs with 1379 different Gene Expressions and 5628 different variations.
- There are no missing values in variants dataset.
- Some of the test data set is machine generated, which would explain bigger test dataset(vs train dataset).
- Most frequent genes in Train Vs Test dataset are very different, but at the same time Most frequent Variations are quite identical.
- A relatively small group of Gene levels make up a sizeable part of the feature values in both train and test data. The test data has fewer high-frequency Genes
Here we see how the Class target is distributed in the train data:
We find:
Class levels 3, 8, and 9 are notably under-represented
Levels 5 and 6 are of comparable, medium-low frequency
Levels 1, 2, and 4 are of comparable, medium-high frequency
Level 7 is clearly the most frequent one
Exploring Feature interactions
Now we want to examine how the features interact with each other and with the target Class variable.
Gene vs Class
First, we will look at the frequency distribution of the overall most frequent Genes for the different Classes. Note the logarithmic frequency scale.
We see immediately that there are significant differences:
Some Genes, like “PTEN”, are predominatly present in a single Class (here: 4).
Other Genes, like “TP53”, are mainly shared between 2 classes (here: 1 and 4).
Classes 8 and 9 contain none of the most frequent Genes.
Here’s what it looks like for the Classes sorted by Genes (again log counts):
This representation underlines our findings about the similar/dominating Genes in different Classes.
Gene vs Variation
Next, we are somewhat repurposing a count plot to visualise how the Variations are distributed for the most frequent Genes. Since there are so many different variations we drop the y-axis labels and merely illustrate how many Gene - Variation combinations exist in the data.
First the training data:
Then the test data:
Once more, the two data sets are rather heterogeneous in this view.
The text files
Overview
The second kind of data files contain a whole lot of text from what looks like scientific papers or proceedings. Here is the beginning of the first entry:
Sure enough, we can easily confirm that the first part of the complete entry corresponds to this paper and later switches to this one (and maybe other related ones.) Therefore, this data file appears to be a data dump of the complete publication texts for the papers that the classification was based on (including figure captions, manuscript structure, and sometimes affiliations).
I’m suspecting that a little domain knowledge will go a long way here in determining which keywords are important and which ones aren’t. This will be an interesting excercise to see how clearly information is communicated in scientific publications.
On data cleaning and preparations
Here I want to collect various text features, artefacts, and global properties that I noticed during this initial exploration. This list will likely expand as the kernel grows.
Scientific terminology and stop words: Most scientific papers have a common style of language that will be reasonably homogeneous throughout the text files. Words like “result” or “discuss” will be frequent without necessarily containing any signal for our prediction goal. Therefore, below I define my own list of additional stop words.
Research field related stop words: My impression is that the list of stop words could be extended by including characteristic terms of the overall research field that are so ubiquitous that their high frequency may mask genuinely interesting terms. Words such as “mutation”, “cancer”, or “tumor” appear to be too general to have much distinguishing power here. The TF-IDF below seems to confirm this. It would be interesting to get some feedback from people with domain knowledge about which other terms could a-priori be removed from the text.
Paper notation quirks: Converting the paper text straight to ascii leads to a number of artefacts. None of those will have a big impact individually, but together they might reduce the accuracy of the analysis:
- Citation numbers (as used e.g. by Nature magazine) are attached to the corresponding word
- Occasionally, there are what seems like webpage navigation commands like “SectionNext” embedden in the text
- Author names and affiliations are occasionally included
Feature Engineering
Text length - txt_len
For an early exploration we can look at the distribution of the length of the text features. A priori, I wouldn’t expect the length of a paper to be related to the classification outcome; but maybe some classifications require only a single paper while for others it’s necessary to check multiple ones.
First, here is the overall distribution of the text entry lengths in train vs test:
The difference in distribution shape might again be due to the machine-generated entries that have been added to the test sample.
Now, let’s see whether this distribution changes for the different target Classes. First, a facet wrap comparison:
Then an overlay of empirical cumulative density functions:
And the median lengths for each class:
We find:
There appear to be significant differences in the shape and median of the test length distributions. Classes 8 and 9 require on average more text, whereas Class 3 has the shortest/fewest papers associated with it.
For what it’s worth, it is tempting to speculate that the apparent multiple peaks in the text length distributions of the individual Classes could correspond to the number of papers that make up the clinical evidence.
Missing text values
In the discussion it was pointed out that a few observations have a “null “ entry in their text features. Using our txt_len feature we can confirm this finding and easily show that there are no other text values with less than 100 characters (just in case a different Null indicator would have been used):
Keyword frequency - pedestrian approach
I want to use this competition to learn more about text mining. While I dive deeper into the applications of the various tools and techniques I will document here what I have learnt. If you are a beginner like me, then maybe this approach will be useful for you. If you are an expert then feel free to skip all the entry-level information (and maybe let me know if I get something seriously wrong.)
Before getting started with specialised tools, here is a first approach based on standard string manipulation methods.
An obvious first step in analysing the content of the clinical evidence is to look how often certain keywords are mentioned in the text of the corresponding papers.
We choose the two words “pathogenic” and “benign” that are used in the naming of the 5 categories in this overview paper. Here we extract their frequency of occurence per observation:
Those are the frequency distributions of the word “pathogenic” for our 9 classes (note the logarithmic y-axes):
And here we plot the ratio of the mean occurence per class of the word “pathogenic” over the mean occurence of the word “benign”:
We find:
The facet plot shows that the word “malignant” is clearly more frequent in certain Classes such as 1, 4, or 5
The ratio plot confirms this impression and suggests two distinct groups of Classes: 2, 7, 8, 9 vs 1, 3, 4. The latter have on average a higher ratio of mentions of “malignant” over “benign” than the former. In addition, Classes 5 and 6 have an even higher ratio of “malignant” over “benign”.
Of course, some of these occurences could have said “not malignant” or “not benign”, which is why we will need to dive further into text analysis to tackle this puzzle.
First steps into text analysis with tidytext
As the authors of the tidytext package put it: The tidy text format is being defined as a table with one token per row; with a token being a word or another meaningful unit of text (paraphrased). Through tidy text we can use the powerful tools of the tidyverse to process and analyse text files. I will follow this excellent and free online book.
In order to get our text data in a tidy shape, we use the unnest_tokens tool. This also gets rid of punctuation and converts everything to lowercase:
The tidytext package contains a dictionary of stop words, like “and” or “next”, which we can remove from our tidy text data. In addition, we will define our own selection of stop words based on the typical structuring language of scientific papers. We also remove tokens that are only numbers or symbols.
For a first overview, we have a look at the overall most popular words and their frequencies. This is our first serious application of tidyverse and ggplot2 tools to text data:
By and large, those are words that we would expect to find in a publication on cancer research and genetics. You will notice that for instance the top 4 words are essentially 2 variants of two basic words each. For our purposes these word variants are likely to obfuscate the signal we are interested in. We can reduce them to their basic meaning, their word stem, using a stemming tool.
As far as I can see, tidytext has currently no native stemming function. Therefore, we will use the “SnowballC” package and its “wordStem” tool:
The result shows us the fundamental words that are most frequent in our overall text data. Another way of visualising these frequencies is through a wordcloud. Personally, I suspect that wordclouds might be the text equivalent of pie charts. But it’s useful to know how to incorporate them into tidy text analysis:
Class-dependent word frequencies
In order to use these word frequencies for prediction we first need to determine them for the individual Classes separately. Below, we join our “text” data with the Class information in the “variants” data set. Afterwards, we determine the relative frequency by Class of each word.
In this example, we will compare Class == 7, the most frequent one, with Classes 1 and 2. Also, we will only look at words with more than 1000 occurences per Class to keep an overview. Here the ability to use dplyr tools starts to pay off properly:
Then, for a visual overview, we plot the frequency of the words in Class 7 against the other two Classes (note the logarithmic axes):
In these plots, words that are close to the dashed line (of equal frequency) have similar frequencies in the corresponding Classes. Words that are further along a particular Class axis (such as “inhibitor” for Class 7 vs 1) are more frequent in that Class. The blue-gray scale indicates how different the Class 7 frequency is from the overall frequency (with higher relative frequencies being lighter). The (slightly jittered) points in the background represent the complete set of (high-frequency) words, whereas the displayed words have been chosen to avoid overlap.
The plots give us a useful overview. For instance, they suggest that Classes 2 and 7 are more similar than 1 and 7. For a more systematic approach we compute the correlation coefficients for each frequency set (this time for the full lists, not just above 1000 occurences):
We find:
Classes 2 and 7 are in fact the most similar ones here, followed by 1 and 4 (correlation coefficients above 0.9)
Overall, the most different Class appears to be number 9, in particular compared to classes 3 and 5 (which are not overwhelming similar to each other). Let’s see what word frequency spread looks like for those combinations:
We find:
There is significantly more of a scatter than in the previous set of plots; especially for Class 5 vs 9.
Interestingly, both “benign” and “pathogen” are more frequent in Class 3 vs 9.
TF-IDF analysis - basics and application
As the competition progresses you will probably see this combination of acronyms more and more often in kernels and discussion. And as a beginner like me, you might not know right away what it means. Let’s start with the basics:
TF stands for term frequency; essentially how often a word appears in the text. This is what we measured above. A list of stop-words can be used to filter out frequent words that likely have no impact on the question we want to answer (e.g. “and” or “the”). However, using stop words might not always be an elegant approach. IDF to the rescue.
IDF means inverse document frequency. Here, we give more emphasis to words that are rare within a collection of documents (which in our case means the entire text data.)
Both measures can be combined into TF-IDF, a heuristic index telling us how frequent a word is in a certain context (here: a certain Class) within the context of a larger document (here: all Classes). You can understand it as a normalisation of the relativ text frequency by the overall document frequency. This will lead to words standing out that are characteristic for a specific Class, which is pretty much what we want to achieve in order to train a model.
Tidytext has the function bind_tf_idf to extract these metrics from a tidy data set that contains words and their counts per Class:
Let’s visualise the most characteristic words and their Class:
Well, that looks sufficiently technical I suppose. A quick google search reveals that “dnmt3b7” is in fact an “aberrant splice form of a DNA methyltransferase, DNMT3B7, expressed in virtually all cancer cell lines but at very low levels in normal cells.” (citation). Here it seems to be associated to Class 8.
Let’s have an overview of the most characteristic terms in each individual Class:
Again, very technical terms here. We notice, though, that some of them (like “brct”) occur in more than one class but still have a high tf-idf.
Word pair frequencies: n-grams
In a similar way as measuring the frequencies of individual words we can also study the properties of groups of words that occur together (like “statistical analysis”). This gives us an idea about the (typical) relationships between words in a certain document.
Tidytext, and other tools, use the concept of the n-gram, which n being the number of adjacent words we want to study as a group. For instance, a bigram is a pair of two words. We can extract all of those pairs in a very similar way as the individual words:
In order to filter out the stop words we need to separate the bigrams first, and then later unite them back together after the filtering. Separate/unite are also the names of the corresponding dplyr functions:
Estimate tf-idf:
And plot the bigrams per Class with the best tf-idf values:
Note, that here we didn’t reduce similar words to their same stem, which leads to similar occurances within Classes (e.g. “dnmt3b7 expression” and “dnmt3b7 expressing” in Class == 8). Still, by and large the contents of the Classes look sufficiently different to be useful for a prediction.
Networks of bigrams
Once we have the bigrams, i.e. sequences of adjacent words, we can also visualise their connections with other words by building a network. A network of words is a combination of connected nodes. Here we use the igraph package to build the network and the ggraph package to visualise it within the context of the tidyverse:
Maybe these networks are not so important for solving this particular problem. But they can give non-biologists like me more of an idea how the various technical concepts are connected.
Here the arrows show the direction of the word relation (e.g. “gene expression” rather than “expression gene”). Transparency is applied to these linking arrows according to the frequency of their occurence (rarer ones are more transparent).
Individual Class networks
Let’s make the same network plots for the individual Classes to investigate their specific terms of importances. In order for this to work, we need to extract the bigram counts separately. For this, we build a short helper function, to which we also assign the flexibility to extract how many bigram combinations to display in the plot. Here, the first parameter of the function is the number of the Class and the second is the lower limit for the bigram word combinations.
These are the 9 different plots. We try to keep the network plots relatively sparse, so that we can see the important connections more clearly. Feel free to experiment with larger numbers of bigrams here.
In the following, I also note a few terms or combinations that appear characteristic to me. As an absolute non-expert, I will probably also note a few trivial terms that don’t relate to our challenge. As the competition goes on, I hope to pick up a few hints on how to clean our input data.
Class 1: We see the connections for “p53” and “brct”. We also find the bigram “tumor supressor”.
Class 2: We see how “ba”, “f3”, and “3t3” related to “cancer cells”.
Class 3: Here, “baf3” and “brca1” seem to be important. Maybe “tyronise kinase” too.
Class 4: Here we have “brca1” and “brct” again, together with another prominent show of “tumor suppressor”.
Class 5: We’ve got “cisplatin sensitivity” and the network of “brca1”.
Class 6: Once more “tumor suppression” and also “e2 interaction”.
Class 7: Here, “egfr” seems to be important for “mutations” and several isolated bigrams can be spotted.
Class 8: Here we see relatively many connections of 3 terms, like “bcor” with “ccnb3” and “rara” or “gbm” with “adult” and “paediatric”.
Class 9: One of the denser networks here shows the relations that connect “idh1” and “u2af1”.
–
Thanks to Heads and Tails on Kaggle for the amazing kernel that inspired this work.