Comment on page
Here we will analyze a lung microbiome OTU dataset from a Sze et al. COPD study using some of Mian's basic features.
Here we will analyze a lung microbiome data from a COPD study from Sze et al.
Let's start by examining the average phylum composition between the control and COPD samples. Go to Visualize > Barplot (Composition) and set the parameters as shown.
Notice that the Bacteroidetes look more enriched in the control samples so let's build a boxplot to examine this in more detail.
The boxplot shows that this difference is significant under the Wilcoxon Rank-Sum test.
Next, let's look at the alpha diversity measures. Here, we'll compare species richness to the surface area measure, as proposed in the original paper.
Decreasing surface area appears to be correlated with decreasing species richness.
What about beta diversity? Note that this tool might take a little bit longer to run because of the permutations. On especially large datasets, the tool might timeout due to resource sharing. If this happens, consider deploying a local instance using a larger host type.
Here, we see that there are significant differences between the COPD and control lung bacteria communities.
Because we can, let's also build a 3D PCA plot to visually see the population groupings between the COPD and control samples. Go ahead and rotate it!
Maybe you think the above PCA is a cool finding - we can take a snapshot of this view and save to our notebook. Just press the "Save Snapshot to Notebook" button at the top right. If we go to our notebook (accessible from our home dashboard), we'll see the following:
Your notebook keeps track of all the parameters you used so if you click "View Original Source", you'll get back to the original plot.
You can also download or share your work. Try clicking on the "Share" button.
You can share this unique link with others and they can open the exact same view and start exploring the data. You can disable access at any time.
Maybe now we want to look for important OTUs. Let's try to use Boruta feature selection to see what OTUs are picked up.
Clicking on any of the OTUs leads to the boxplot for further visualization. For instance, here, let's click on Otu0009.
Here, indeed we see that Otu0009 is important as the control samples are highly enriched for Otu0009.
Could we build a classifier to predict COPD status? Let's find out by training a simple logistic regression model on the COPD data.
We see that the model achieves 0.9375 test accuracy which indicates that the two groups can be distinguished quite easily by the model.
We can also try a deep learning model. However, the COPD data was quite small so it might be difficult to generalize - for this, let's try using the larger Coral dataset.
With a two-layer feed-forward neural network with dropout, we can achieve a 0.9711 test accuracy over about 30 epochs.
We explored our dataset from top to bottom. We visualized composition differences, analyzed intra- and inter-community diversity, automatically selected the important OTUs, and trained a few promising machine learning models.
There are many other tools not covered here but are ready for you to use - happy exploring!