How does GSEA and ssGSEA work?

GSEA

Sort all genes

The first step is to sort all genes according to log(fc) from two groups.

Start to calculate

L: entire genes

S: targetted geneset

The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S.

img

The enrichment score is the maximum deviation from zero encountered in the random walk.

However, it's noted that every gene in S has specific weightes and genes in L-S are same weights.

Estimate of Significance Level of ES

The significance of an observed ES by comparing it with the set of scores ESNULL computed with randomly assigned phenotypes.

  1. Randomly assign the original phenotype labels to samples, reorder genes, and re-compute ES(S).

  2. Repeat step 1 for 1,000 permutations, and create a histogram of the corresponding enrichment scores ESNULL.

  3. Estimate nominal P value for S from ESNULL by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).

ssGSEA

The first difference is also to sort all the genes but through absolute expression.

The second difference is the weighted values. For a given signature G of size \(N_G\) and single sample S, of the data set of N genes, the genes are replaced by their ranks according the their absolute expression from high to low. Note that the exponent of this quantity (α) is set to 1/4, and adds a modest weight to the rank.

The third difference is the enrichment score ES(G,S) is obtained by a sum (integration) of the difference between a weighted ECDF of the genes in the signature \(P_G^w\) and the ECDF of the remaining genes \(P_{N_G}\) .

Reference