edgePy package¶

Subpackages¶

edgePy.data_import package

Submodules¶

edgePy.DGEList module¶

class edgePy.DGEList.DGEList(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, filename: Optional[str] = None, current_transform_type: Optional[str] = None, current_log_status: Optional[bool] = False)[source]¶

Bases: object

Class containing read counts over genes for multiple samples and their corresponding metadata.

Parameters:

counts – Columns correspond to samples and row to genes.
samples – Array of sample names, same length as ncol(counts).
genes – Array of gene names, same length as nrow(counts).
norm_factors – Weighting factors for each sample.
groups_in_list – a list of groups to which each sample belongs, in the same order as samples or
groups_in_dict – a dictionary of groups, containing sample names.
to_remove_zeroes – To remove genes with zero counts for all samples.
filename – a shortcut to import NPZ (zipped numpy format) files.
current_type – None means raw counts, otherwise, if transformed, store a string (eg. ‘cpm’, ‘rpkm’, etc)
current_log – Optional[bool] = False, If counts has already been log transformed, store True.

Examples

>>> from edgePy.data_import import get_dataset_path
>>> dataset = 'GSE49712_HTSeq.txt.gz'
>>> group_file = 'groups.json'
>>> DGEList.create_DGEList_data_file(get_dataset_path(dataset), get_dataset_path(group_file))
DGEList(num_samples=10, num_genes=21,711)

static _format_fields(fields: Iterable[Union[str, bytes]]) → Generator[[str, None], None][source]¶

Clean fields in the header of any read data.

Yields:	The next field that has been cleaned.

static _sample_group_dict(groups_list: List[str], samples: numpy.core.multiarray.array)[source]¶

Converts data in the form [‘group1’, ‘group1’, ‘group2’, ‘group2’] to the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’}

Parameters:	groups_list – group names in a list, in the same order as samples.
Returns:	dictionary containing the sample types, each with a list of samples.

static _sample_group_list(groups_dict, samples)[source]¶

Converts data in the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’} to the form [‘group1’, ‘group1’, ‘group2’, ‘group2’]

Parameters:	groups_dict – dictionary containing the sample types, each with a list of samples. samples – order of samples in the DGEList
Returns:	data in a list, in the same order as samples.

copy(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, current_type: Optional[str] = None, current_log: Optional[bool] = False) → edgePy.DGEList.DGEList[source]¶

counts¶

The read counts for the genes in all samples.

Returns:	Columns correspond to samples and row to genes.
Return type:	counts

cpm(transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]¶: Normalize the DGEList to read counts per million.

classmethod create_DGEList(sample_list: List[str], data_set: Dict[collections.abc.Hashable, Any], gene_list: List[str], sample_to_category: Optional[List[str]] = None, category_to_samples: Optional[Dict[collections.abc.Hashable, List[str]]] = None) → edgePy.DGEList.DGEList[source]¶: sample list and gene list must be pre-sorted Use this to create the DGE object for future work.

classmethod create_DGEList_data_file(data_file: pathlib.Path, group_file: pathlib.Path, **kwargs) → edgePy.DGEList.DGEList[source]¶

Wrapper for creating DGEList objects from file locations. Performs open and passes the file handles to the method for creating a DGEList object.

This function uses smart_open, which provides a broad list of data sources that can be opened. For a full list of data sources, see smart_open’s documentation at https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst

Parameters:	data_file – Text file defining the data set. group_file – The JSON file defining the groups. kwargs – Additional arguments supported by `np.genfromtxt`.
Returns:	Container for storing read counts for samples.
Return type:	DGEList

classmethod create_DGEList_handle(data_handle: _io.StringIO, group_handle: _io.StringIO, **kwargs) → edgePy.DGEList.DGEList[source]¶

Read in a file-like object of delimited data for instantiation.

Args:get_canonical: data_handle: Text file defining the data set. group_handle: The JSON file defining the groups. kwargs: Additional arguments supported by np.genfromtxt.

Returns:	Container for storing read counts for samples.
Return type:	DGEList

genes¶: Array of gene names.

get_gene_mask_and_lengths(gene_data)[source]¶: use gene_data to get the gene lenths and a gene mask for the tranformation. :Parameters: gene_data – the object that holds gene data from ensembl

library_size¶

The total read counts per sample.

Returns:	The size of the library.
Return type:	library_size

log_transform(counts, prior_count)[source]¶: Compute the log of the counts

read_npz_file(filename: str) → None[source]¶

Import a file name stored in the dge export format.

Parameters:	filename – the name of the file to read from.

rpkm(gene_data: edgePy.data_import.ensembl.ensembl_flat_file_reader.CanonicalDataStore, transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]¶

Return the DGEList normalized to reads per kilobase of gene length per million reads. (RPKM = numReads / ( geneLength/1000 * totalNumReads/1,000,000 )

Parameters:	gene_data – An object that works to import Ensembl based data, for use in calculations transform_to_log – true, if you wish to convert to log after converting to RPKM prior_count – a minimum value for genes, if you do log transforms.

samples¶: Array of sample names.

tpm(gene_lengths: numpy.ndarray, transform_to_log: bool = False, prior_count: float = 0.25, mean_fragment_lengths: numpy.ndarray = None) → edgePy.DGEList.DGEList[source]¶

Normalize the DGEList to transcripts per million.

Adapted from Wagner, et al. ‘Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples.’ doi:10.1007/s12064-012-0162-3

Read counts \(X_i\) (for each gene \(i\) with gene length \(\widetilde{l_j}\) ) are normalized as follows:

\[TPM_i = \frac{X_i}{\widetilde{l_i}}\cdot \ \left(\frac{1}{\sum_j \frac{X_j}{\widetilde{l_j}}}\right) \cdot 10^6\]

Parameters:	gene_lengths – 1D array of gene lengths for each gene in the rows of DGEList.counts. transform_to_log – store log outputs prior_count mean_fragment_lengths – 1D array of mean fragment lengths for each sample in the columns of DGEList.counts (optional)

write_npz_file(filename: str) → None[source]¶: Convert the object to a byte representation, which can be stored or imported.

edgePy package¶

Subpackages¶

Submodules¶

edgePy.DGEList module¶

Module contents¶