edgePy package

Submodules

edgePy.DGEList module

class edgePy.DGEList.DGEList(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, filename: Optional[str] = None, current_transform_type: Optional[str] = None, current_log_status: Optional[bool] = False)[source]

Bases: object

Class containing read counts over genes for multiple samples and their corresponding metadata.

Parameters:
  • counts – Columns correspond to samples and row to genes.
  • samples – Array of sample names, same length as ncol(counts).
  • genes – Array of gene names, same length as nrow(counts).
  • norm_factors – Weighting factors for each sample.
  • groups_in_list – a list of groups to which each sample belongs, in the same order as samples or
  • groups_in_dict – a dictionary of groups, containing sample names.
  • to_remove_zeroes – To remove genes with zero counts for all samples.
  • filename – a shortcut to import NPZ (zipped numpy format) files.
  • current_type – None means raw counts, otherwise, if transformed, store a string (eg. ‘cpm’, ‘rpkm’, etc)
  • current_log – Optional[bool] = False, If counts has already been log transformed, store True.

Examples

>>> from edgePy.data_import import get_dataset_path
>>> dataset = 'GSE49712_HTSeq.txt.gz'
>>> group_file = 'groups.json'
>>> DGEList.create_DGEList_data_file(get_dataset_path(dataset), get_dataset_path(group_file))
DGEList(num_samples=10, num_genes=21,711)
static _format_fields(fields: Iterable[Union[str, bytes]]) → Generator[[str, None], None][source]

Clean fields in the header of any read data.

Yields:The next field that has been cleaned.
static _sample_group_dict(groups_list: List[str], samples: numpy.core.multiarray.array)[source]

Converts data in the form [‘group1’, ‘group1’, ‘group2’, ‘group2’] to the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’}

Parameters:groups_list – group names in a list, in the same order as samples.
Returns:dictionary containing the sample types, each with a list of samples.
static _sample_group_list(groups_dict, samples)[source]

Converts data in the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’} to the form [‘group1’, ‘group1’, ‘group2’, ‘group2’]

Parameters:
  • groups_dict – dictionary containing the sample types, each with a list of samples.
  • samples – order of samples in the DGEList
Returns:

data in a list, in the same order as samples.

copy(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, current_type: Optional[str] = None, current_log: Optional[bool] = False) → edgePy.DGEList.DGEList[source]
counts

The read counts for the genes in all samples.

Returns:Columns correspond to samples and row to genes.
Return type:counts
cpm(transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]

Normalize the DGEList to read counts per million.

classmethod create_DGEList(sample_list: List[str], data_set: Dict[collections.abc.Hashable, Any], gene_list: List[str], sample_to_category: Optional[List[str]] = None, category_to_samples: Optional[Dict[collections.abc.Hashable, List[str]]] = None) → edgePy.DGEList.DGEList[source]

sample list and gene list must be pre-sorted Use this to create the DGE object for future work.

classmethod create_DGEList_data_file(data_file: pathlib.Path, group_file: pathlib.Path, **kwargs) → edgePy.DGEList.DGEList[source]

Wrapper for creating DGEList objects from file locations. Performs open and passes the file handles to the method for creating a DGEList object.

This function uses smart_open, which provides a broad list of data sources that can be opened. For a full list of data sources, see smart_open’s documentation at https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst

Parameters:
  • data_file – Text file defining the data set.
  • group_file – The JSON file defining the groups.
  • kwargs – Additional arguments supported by np.genfromtxt.
Returns:

Container for storing read counts for samples.

Return type:

DGEList

classmethod create_DGEList_handle(data_handle: _io.StringIO, group_handle: _io.StringIO, **kwargs) → edgePy.DGEList.DGEList[source]

Read in a file-like object of delimited data for instantiation.

Args:get_canonical
data_handle: Text file defining the data set. group_handle: The JSON file defining the groups. kwargs: Additional arguments supported by np.genfromtxt.
Returns:Container for storing read counts for samples.
Return type:DGEList
genes

Array of gene names.

get_gene_mask_and_lengths(gene_data)[source]

use gene_data to get the gene lenths and a gene mask for the tranformation. :Parameters: gene_data – the object that holds gene data from ensembl

library_size

The total read counts per sample.

Returns:The size of the library.
Return type:library_size
log_transform(counts, prior_count)[source]

Compute the log of the counts

read_npz_file(filename: str) → None[source]

Import a file name stored in the dge export format.

Parameters:filename – the name of the file to read from.
rpkm(gene_data: edgePy.data_import.ensembl.ensembl_flat_file_reader.CanonicalDataStore, transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]

Return the DGEList normalized to reads per kilobase of gene length per million reads. (RPKM = numReads / ( geneLength/1000 * totalNumReads/1,000,000 )

Parameters:
  • gene_data – An object that works to import Ensembl based data, for use in calculations
  • transform_to_log – true, if you wish to convert to log after converting to RPKM
  • prior_count – a minimum value for genes, if you do log transforms.
samples

Array of sample names.

tpm(gene_lengths: numpy.ndarray, transform_to_log: bool = False, prior_count: float = 0.25, mean_fragment_lengths: numpy.ndarray = None) → edgePy.DGEList.DGEList[source]

Normalize the DGEList to transcripts per million.

Adapted from Wagner, et al. ‘Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples.’ doi:10.1007/s12064-012-0162-3

Read counts \(X_i\) (for each gene \(i\) with gene length \(\widetilde{l_j}\) ) are normalized as follows:

\[TPM_i = \frac{X_i}{\widetilde{l_i}}\cdot \ \left(\frac{1}{\sum_j \frac{X_j}{\widetilde{l_j}}}\right) \cdot 10^6\]
Parameters:
  • gene_lengths – 1D array of gene lengths for each gene in the rows of DGEList.counts.
  • transform_to_log – store log outputs
  • prior_count
  • mean_fragment_lengths – 1D array of mean fragment lengths for each sample in the columns of DGEList.counts (optional)
write_npz_file(filename: str) → None[source]

Convert the object to a byte representation, which can be stored or imported.

Module contents