FastRNA tutorial

This notebook gives an example on how to use FastRNA for single-cell RNA sequencing data. The data used in this tutorial can be found in the package github repo. The original data can be found here.

[1]:
# import basic libraries
import numpy as np
import pandas as pd
import scipy.io as io

from fastrna.core import fastrna_hvg, fastrna_pca

First, load the data matrix. This data should be in CSC format with float32 datatype. tocsc() converts the data to csc formant and astype(np.float32) converts the data to float32.

[2]:
mtx = io.mmread('datasets/mat.mtx').tocsc().astype(np.float32)
mtx.shape
[2]:
(33694, 9806)

The following code loads the metadata.

[3]:
meta = pd.read_csv('datasets/meta.csv', index_col=0)
meta.head() # Method coloumn contains the batch label
[3]:
nGene nUMI percent.mito Cluster CellType Experiment Method
pbmc1_10x_v2_A_AAAGATGCAAAGTCAA 851 2177 0.038126 5 CD14+ monocyte pbmc1 10x Chromium (v2) A
pbmc1_10x_v2_A_AAAGCAAGTAGGAGTC 1078 3065 0.041762 5 CD14+ monocyte pbmc1 10x Chromium (v2) A
pbmc1_10x_v2_A_AAAGCAATCGGTTCGG 538 977 0.099284 4 CD14+ monocyte pbmc1 10x Chromium (v2) A
pbmc1_10x_v2_A_AAAGTAGTCATTTGGG 1544 4933 0.042773 5 CD14+ monocyte pbmc1 10x Chromium (v2) A
pbmc1_10x_v2_A_AAAGTAGTCCGAGCCA 632 1487 0.047747 4 CD14+ monocyte pbmc1 10x Chromium (v2) A

Indexing is an essential component in data processing. The computational cost of indexing is relatively small when the data is small but grow significantly when the data is large. This is especially important for sparse matrices becaue of its special structure. Since FastRNA utilizes this specialized structure for efficient indexing of data, it requires reordering of data before analysis.

FastRNA input matrix should be ordered according to its batch label prior to analysis.

[4]:
batch_label = pd.factorize(meta["Method"])[0] # change the column name (Method) if required,
batch_label # the above line converts the batch label into an integer
[4]:
array([0, 0, 0, ..., 2, 2, 2])
[5]:
idx_sort = batch_label.argsort() # sort the index of batch_label in an ascending order
batch_label_sort = batch_label[idx_sort] # sort batch label
meta_sort = meta.iloc[idx_sort,:] # sort metadata
mtx_sort = mtx[:,idx_sort] # sort count matrix

Now, we can perform feature selection. fastrna_hvg calculates the normalized variance of the genes.

[6]:
%time gene_vars = fastrna_hvg(mtx_sort, batch_label_sort)
CPU times: user 2.86 s, sys: 184 ms, total: 3.05 s
Wall time: 111 ms

Normalized gene variance can be used for feature selection. We order the genes according to their normalized variances in a descending order.

[7]:
gene_idx_var = gene_vars.argsort()[::-1]
mtx_hvg = mtx_sort[gene_idx_var[:3000],:] # select 3000 features
mtx_hvg.sort_indices() # sort indices after gene selection, required due to sparse matrix structure
numi = np.asarray(mtx_sort.sum(axis=0)).ravel() # calculate size factor
%time eig_val, eig_vec, pca, rrt = fastrna_pca(mtx_hvg, numi, batch_label_sort)
CPU times: user 1min 3s, sys: 19 s, total: 1min 22s
Wall time: 1.89 s

pca contains the principal components of the cells.

[8]:
pca.shape, pca
[8]:
((9806, 50),
 array([[ -6.155175  , -32.147507  ,   1.9936739 , ...,  -0.53302026,
           0.14496931,  -0.11947128],
        [ -3.995697  , -16.785759  ,  -0.29182994, ...,  -0.37005997,
           0.338378  ,  -0.34952265],
        [-10.0068655 , -36.92015   ,  -1.3498954 , ...,   0.9612558 ,
          -0.42493236,  -0.29161602],
        ...,
        [-11.459299  , -57.605152  ,  -0.71217346, ...,   1.446156  ,
          -1.0820694 ,  -1.338346  ],
        [  1.5810945 ,   3.308975  ,   1.6145313 , ...,  -0.2889882 ,
           0.3467002 ,   0.14786568],
        [ -0.54217005,   6.396435  , -17.914013  , ...,  -0.28510135,
          -1.4869949 ,  -0.18762672]], dtype=float32))