edgePy package¶
Subpackages¶
Submodules¶
edgePy.DGEList module¶
-
class
edgePy.DGEList.
DGEList
(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, filename: Optional[str] = None, current_transform_type: Optional[str] = None, current_log_status: Optional[bool] = False)[source]¶ Bases:
object
Class containing read counts over genes for multiple samples and their corresponding metadata.
Parameters: - counts – Columns correspond to samples and row to genes.
- samples – Array of sample names, same length as ncol(counts).
- genes – Array of gene names, same length as nrow(counts).
- norm_factors – Weighting factors for each sample.
- groups_in_list – a list of groups to which each sample belongs, in the same order as samples or
- groups_in_dict – a dictionary of groups, containing sample names.
- to_remove_zeroes – To remove genes with zero counts for all samples.
- filename – a shortcut to import NPZ (zipped numpy format) files.
- current_type – None means raw counts, otherwise, if transformed, store a string (eg. ‘cpm’, ‘rpkm’, etc)
- current_log – Optional[bool] = False, If counts has already been log transformed, store True.
Examples
>>> from edgePy.data_import import get_dataset_path >>> dataset = 'GSE49712_HTSeq.txt.gz' >>> group_file = 'groups.json' >>> DGEList.create_DGEList_data_file(get_dataset_path(dataset), get_dataset_path(group_file)) DGEList(num_samples=10, num_genes=21,711)
-
static
_format_fields
(fields: Iterable[Union[str, bytes]]) → Generator[[str, None], None][source]¶ Clean fields in the header of any read data.
Yields: The next field that has been cleaned.
-
static
_sample_group_dict
(groups_list: List[str], samples: numpy.core.multiarray.array)[source]¶ Converts data in the form [‘group1’, ‘group1’, ‘group2’, ‘group2’] to the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’}
Parameters: groups_list – group names in a list, in the same order as samples. Returns: dictionary containing the sample types, each with a list of samples.
-
static
_sample_group_list
(groups_dict, samples)[source]¶ Converts data in the form {‘group1’: [‘sample1’, ‘sample2’], ‘group2’: [‘sample3’, ‘sample4’} to the form [‘group1’, ‘group1’, ‘group2’, ‘group2’]
Parameters: - groups_dict – dictionary containing the sample types, each with a list of samples.
- samples – order of samples in the DGEList
Returns: data in a list, in the same order as samples.
-
copy
(counts: Optional[numpy.ndarray] = None, samples: Optional[numpy.core.multiarray.array] = None, genes: Optional[numpy.core.multiarray.array] = None, norm_factors: Optional[numpy.core.multiarray.array] = None, groups_in_list: Optional[numpy.core.multiarray.array] = None, groups_in_dict: Optional[Dict] = None, to_remove_zeroes: Optional[bool] = False, current_type: Optional[str] = None, current_log: Optional[bool] = False) → edgePy.DGEList.DGEList[source]¶
-
counts
¶ The read counts for the genes in all samples.
Returns: Columns correspond to samples and row to genes. Return type: counts
-
cpm
(transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]¶ Normalize the DGEList to read counts per million.
-
classmethod
create_DGEList
(sample_list: List[str], data_set: Dict[collections.abc.Hashable, Any], gene_list: List[str], sample_to_category: Optional[List[str]] = None, category_to_samples: Optional[Dict[collections.abc.Hashable, List[str]]] = None) → edgePy.DGEList.DGEList[source]¶ sample list and gene list must be pre-sorted Use this to create the DGE object for future work.
-
classmethod
create_DGEList_data_file
(data_file: pathlib.Path, group_file: pathlib.Path, **kwargs) → edgePy.DGEList.DGEList[source]¶ Wrapper for creating DGEList objects from file locations. Performs open and passes the file handles to the method for creating a DGEList object.
This function uses smart_open, which provides a broad list of data sources that can be opened. For a full list of data sources, see smart_open’s documentation at https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst
Parameters: - data_file – Text file defining the data set.
- group_file – The JSON file defining the groups.
- kwargs – Additional arguments supported by
np.genfromtxt
.
Returns: Container for storing read counts for samples.
Return type:
-
classmethod
create_DGEList_handle
(data_handle: _io.StringIO, group_handle: _io.StringIO, **kwargs) → edgePy.DGEList.DGEList[source]¶ Read in a file-like object of delimited data for instantiation.
- Args:get_canonical
- data_handle: Text file defining the data set.
group_handle: The JSON file defining the groups.
kwargs: Additional arguments supported by
np.genfromtxt
.
Returns: Container for storing read counts for samples. Return type: DGEList
-
genes
¶ Array of gene names.
-
get_gene_mask_and_lengths
(gene_data)[source]¶ use gene_data to get the gene lenths and a gene mask for the tranformation. :Parameters: gene_data – the object that holds gene data from ensembl
-
library_size
¶ The total read counts per sample.
Returns: The size of the library. Return type: library_size
-
read_npz_file
(filename: str) → None[source]¶ Import a file name stored in the dge export format.
Parameters: filename – the name of the file to read from.
-
rpkm
(gene_data: edgePy.data_import.ensembl.ensembl_flat_file_reader.CanonicalDataStore, transform_to_log: bool = False, prior_count: float = 0.25) → edgePy.DGEList.DGEList[source]¶ Return the DGEList normalized to reads per kilobase of gene length per million reads. (RPKM = numReads / ( geneLength/1000 * totalNumReads/1,000,000 )
Parameters: - gene_data – An object that works to import Ensembl based data, for use in calculations
- transform_to_log – true, if you wish to convert to log after converting to RPKM
- prior_count – a minimum value for genes, if you do log transforms.
-
samples
¶ Array of sample names.
-
tpm
(gene_lengths: numpy.ndarray, transform_to_log: bool = False, prior_count: float = 0.25, mean_fragment_lengths: numpy.ndarray = None) → edgePy.DGEList.DGEList[source]¶ Normalize the DGEList to transcripts per million.
Adapted from Wagner, et al. ‘Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples.’ doi:10.1007/s12064-012-0162-3
Read counts \(X_i\) (for each gene \(i\) with gene length \(\widetilde{l_j}\) ) are normalized as follows:
\[TPM_i = \frac{X_i}{\widetilde{l_i}}\cdot \ \left(\frac{1}{\sum_j \frac{X_j}{\widetilde{l_j}}}\right) \cdot 10^6\]Parameters: - gene_lengths – 1D array of gene lengths for each gene in the rows of DGEList.counts.
- transform_to_log – store log outputs
- prior_count
- mean_fragment_lengths – 1D array of mean fragment lengths for each sample in the columns of DGEList.counts (optional)