The ECN committee are pleased to announce the second webinar in our series on multivariate statistics that will be delivered by Dr Claus Mayer:
Sparse multivariate methods and integration of omics data sets
Friday 6th September, 16:00-17:00 CET
Multivariate methods like principle component analysis (PCA) or partial least squares (PLS) are essential in revealing structure in high-dimensional omics type data, where the number of variables is typically much larger than the number of samples (p>>n). As useful as these methods are to study the relationship between samples the high number of variables obscures the interpretation which genes, proteins or metabolites contribute to the patterns we see. Sparse methods enforce the loadings of most variables to be 0, while still explaining much of the variation in the data and thus enable an easier biological interpretation of the results. Dr Mayer will introduce sparse versions of some commonly used multivariate methods and illustrate their use in data examples.
In a second part he will present methods that simultaneously analyse two (or more) data sets like Canonical Correlation Analysis (CCA) or Co-Inertia Analysis (CIA). These tools allow to study the joint influence of two sets of variables (eg. a transcriptomic and a proteomic data set) on the variation within samples while showing the relationship between the data sets at the same time.
Dr Claus Mayer is a senior statistician working for Biomathematics and Statistics Scotland. His main area of research in recent years has been the analysis of high-dimensional genomics data with a particular emphasis on gene expression studies (microarrays, RNAseq) and related areas (proteomics, methylation studies. Dr Mayer has worked on methods of integrating/combining such omics data sets from different sources like combining high-dimensional data from different stages of an experiment in a group-sequential setting, conducting meta-analysis of comparable gene expression studies or integrating different types of omics data collected from the same samples. Dr Mayer has also investigated ways of quickly calculating overall summary statistics of pairwise (cross-) correlations within one or more high-dimensional data sets and has studied ways of turning such (partial) correlation structures into sparse biologically interpretable networks.
Sign up for the webinar at:
The recorded webinar is now available at:
Extra Q and As from Claus:
What other omics integration tools have you tested on your datasets (DIABLO, MINT, sMBPLS...)?
This particular data set is more than 10 years old. I think Le Cao had published some of her papers by then, but the mixOmics package didn’t exist yet and neither did DIABLO and MINT. In a way doing this seminar was one way for me get into these packages a bit more. At that time we used the ade4 and made4 packages, which I think are pre-runners of omicade4. I wasn’t aware of the sMBPLS package, but it looks very interesting so will check it out. I think there are slightly different cultures depending on which omics data you deal with. The metabolomics world seems very strong on variants of PLS, whereas the Gene Expression folk seem to love network analysis (which you can regard as multivariate analysis too, but is very different).
Is there any rule for a maximum number of variables that can be assessed according to the number of samples (when the number of variables is huge but the number of subjects do not exceed a few hundreds)?
No, I don’t think so. The number of variables will only ever grow in comparison to the number of samples and saying there is a maximum number would imply we (as statisticians) refuse to accept our responsibility to deal with that. Statistics has developed rapidly in this area in the last 20 years, and the best and most successful methods tend to make use of the high-dimensionality rather than seeing it as a curse.
A pdf copy of the webinar presentation is available in the attachment to this message for ECN members.