Overview¶
dmbiolib is a library of functions used in various bionformatics projects.
- Source code:
- Python package:
- Bioconda package:
- Bug report / feature requests:
Installation¶
dmbiolib can be installed using pip:
pip install dmbiolib
Note that dependencies might need to be installed individually.
Usage¶
dmbiolib needs to be imported before its functions can be used. Example:
import dmbiolib as dbl
print(dbl.transl('atgcgattcacg'))
Latest news¶
Functions¶
aa_dist(seqs,parvrs,fname,r)¶
seqs: dictionary of dictionaries {vr1:{seq1:n1, seq2:n2, …}, vr2:{…}, …} where vr1, vr2 are variable region names, seq1, seq2 are amino acid sequences, n1, n2 are read counts
parvrs: dictionary of tuples {vr1:(seq1, pos1), vr2:(seq2, pos2), …} where vr1, vr2 are variable region names (must be the same as in seqs), seq1, seq2 are parental amino acid sequences, pos1, pos2 are position numbers in the parental protein chain (can be None of no parental sequence)
fname: name of output file (can be None if no need to save results)
r: handle of report file (can be None if not used)
aln2seq(filename,type,full,reference)¶
filename: file containing multiple sequence alignment in caplib3 format
type: dna or aa
full: True if full sequences are to be returned (only valid if reference is provided)
reference: name of file containing the reference sequence
check_file(filename,strict)¶
filename: file name to be tested
strict: True or False
check_plot_format(x)¶
x: string to be tested
check_read_file(filename)¶
filename: name of file to be tested for containing sequencing reads
check_seq(sequence,type,required)¶
sequence: amino acid or nucleotide sequence
type: dna (‘atgc’), ambiguous (‘ryswkmbdhvn’), aa (‘ARNDCQEGHILKMFPSTWYV’), or any string (case insensitive)
required: string of characters (or type name as above), at least one of which must exist in the sequence
Examples:
import dmbiolib as dbl
print(dbl.check_seq('cgttcgaac',dbl.dna,dbl.dna))
True, True
print(dbl.check_seq('cgttnnaac',dbl.dna,dbl.dna))
False, True
print(dbl.check_seq('cgttnnaac',dbl.dna,dbl.ambiguous))
True, True
check_sync(read1,read2)¶
read1, read2: nucleotide sequences
complexity(sequence)¶
sequence: nucleotide sequence (including ambiguous nucleotides) to be translated (in frame)
Example:
import dmbiolib as dbl
x=dbl.complexity('atgdbctss')
for n in x:
print(n)
defaultdict(<class 'int'>, {'M': 1})
defaultdict(<class 'int'>, {'F': 1, 'C': 1, 'S': 2, 'V': 1, 'G': 1, 'A': 1, 'I': 1, 'T': 1})
defaultdict(<class 'int'>, {'W': 1, 'C': 1, 'S': 2})
compress(sequence):¶
sequence: nucleotide sequence
Example:
import dmbiolib as dbl
print(dbl.compress('gggcaatccccnnnncaagtt'))
gcatcnnnncagt
conf_start(title)¶
conf_end(filename,content,title)¶
csv_read(filename,dic,header)¶
filename: name of csv file to be read
dic (True/False): whether to store the contents of the csv file in a dictionary (True) or a lst (False).
header (True/False): whether the file starts with a header or not (or directly with the data)
csv_write(filename,keys,list_or_dic,header,description,file_handle)¶
filename: name of csv file to be created
keys: optional first column (if not already part of the list or dictionary)
list_or_dic: list (or tuple) or dictionary containing the data (which can be strings, lists, tuples or dictionaries) to be written into the csv file
header: optional top row to be written before the main data
description: file description to be used in the message confirming completion of csv file
file_handle: file_handle of the report file (or None if no report file)
detect_vr(libnt,mindist)¶
libnt: in frame, protein-coding library nucleotide sequence (containing ambiguous positions)
mindist: minimum distance between 2 variable regions
diff(sequences)¶
sequences: list of sequences
Examples:
import dmbiolib as dbl
print(dbl.diff(['agct','gatc','ctga','tcag']))
4
print(dbl.diff(['agct','gatc','ctga','aata']))
2
dirname()¶
Example, if current directory is /home/someuser/somedir:
print(dirname())
somedir
entropy(matrix)¶
matrix: list of lists of values
exprange(a,b,c)¶
a,b: range boundaries
c: multiplying factor
Example:
import dmbiolib as dbl
x=dbl.exprange(1,100,3)
for n in x:
print(n)
1
3
9
27
81
find_ambiguous(seq)¶
seq: nucleotide sequence (containing ambiguous nucleotides)
Example:
import dmbiolib as dbl
seq='gatcgatcgtnnnnngactgavvmttcgsbynccgtcga'
print(dbl.find_ambiguous(seq))
{10: 5, 21: 3, 28: 4}
find_read_files()¶
findall(probe,seq,start,end,overlap=False)¶
probe: string, occurrences of which are searched in seq
seq: string in which probe is searched
start: seq start index of search (0 if no limit)
end: seq end index of search (None of no limit)
overlap: optional argument to allow overlaps (default: False)
format_dna(seq,margin,cpl,cpn)¶
seq: raw nucleotide sequence
margin: left margin
cpl: number of characters per line
cpn: number of characters per number
Example:
seq='gatcgatcgatcgatcgtacgtatcgatcgatcgatcgatcgactgatcagctacgatcgatcgatcgatgtgacccccttagc'
print(dbl.format_dna(seq,5,30,10))
10 20 30
gatcgatcgatcgatcgtacgtatcgatcg
40 50 60
atcgatcgatcgactgatcagctacgatcg
70 80
atcgatcgatgtgacccccttagc
frame(seq,strict=False)¶
seq: nucleotide sequences to be examined
strict: when True, will return None if the guess is too speculative (optional argument, default: False)
fsize(filename)¶
getfasta(fname,type,required,multi)¶
fname: name of the fasta file to be opened
type: dna or aa
required: same as type, or ‘ambiguous’ if some ambiguous nucleotides must be present
multi: Whether the file contains multiple sequences (True) or a single one (False).
getread(f,y,counter)¶
f: file handle
y: number of lines per sequence (or 0 if variable number)
counter: number of reads already processed
initreadfile(rfile)¶
rfile: read file (can be fasta or fastq, uncompressed or gzipped)
intorfloat(x)¶
x: string to be tested whether it can be converted into an integer or a float
match(seq1, seq2)¶
seq1, seq2: nucleotide sequences (with or without ambiguous nucleotides)
Examples:
import dmbiolib as dbl
dbl.match('acgatcg','accatcg')
False
dbl.match('acgatcg','acsancg')
True
mean(x)¶
x: list or tuple of numerical values
Example:
import dmbiolib as dbl
print(dbl.mean([12,30,24]))
22.0
mut_per_read(seqs,parseq,fname,r)¶
seqs: dictionary {seq1:n1, seq2:n2, …} where seq1, seq2: amino acid sequences, n1, n2: numbers of reads
parseq: parental sequence (must be same length as sequences in seqs)
fname: name of output file (can be None if no need to save results)
r: handle of report file (can be None if not used)
nt_match(nt1, nt2)¶
nt1, nt2: nucleotide (a, g, c, t or ambiguous)
Examples:
import dmbiolib as dbl
dbl.nt_match('a','a')
True
dbl.nt_match('a','g')
False
dbl.nt_match('n','a')
True
dbl.nt_match('s','n')
True
dbl.nt_match('r','y')
False
dbl.nt_match('g','s')
True
plot_end(fig,name,format,mppdf)¶
fig: figure handle
name: file name without extension (if each figure is saved individually)
format: extension corresponding to the chosen figure format (if each figure is saved individually)
mppdf: PdfPages handle (if all figures saved in single file pdf)
plot_start(x,y,z)¶
x: color map to be used
y: number of colors needed
z: plot title
pr2(f,text)¶
f: file handle
text: text to be printed
prefix(x)¶
x: list of file names
Example:
import dmbiolib as dbl
x=['P0-left_L4_2.fq.gz', 'P0-right_L4_2.fq.gz', 'P1-left_L4_2.fq.gz', 'P1-right_L4_2.fq.gz', 'P2-left_L4_2.fq.gz', 'P2-right_L4_2.fq.gz']
print(dbl.prefix(x))
['P0-left', 'P0-right', 'P1-left', 'P1-right', 'P2-left', 'P2-right']
prod(x)¶
x: list or tuple of numbers
progress_check(c,show,text)¶
c: read counter
show: dictionary of read numbers that trigger a new % value to the progress counter
text: text describing the process (should be the same as in progress_start(nr,text))
progress_end()¶
progress_start(nr,text)¶
nr: number of reads
text: text describing the process
readcount(R)¶
R: name of read file
fail: fail message
rename(filename)¶
filename: name of the file to be renamed
revcomp(seq)¶
seq: nucleotide sequence
Example:
revcomp('agctgctaa')
ttagcagct
rfile_create(filename)¶
filename: name of the read file to be created
rfile_open(filename)¶
filename: name of the read file to be opened
seq_clust_card_dist(seqs,fname,r)¶
seqs: either a list [n1, n2, …] or a dictionary {seq1:n1, seq2:n2, …} where seq1, seq2: amino acid sequences, n1, n2: numbers of reads
fname: name of output file (can be None if no need to save results)
r: handle of report file (can be None if not used)
seq_write(fname,top,seqs,dic,descr,r)¶
fname: name of file to be created
top: string to be added to top of file
seqs: list of sequences (or None)
dic: dictionary of sequences with their read numbers {seq1:n1, seq2:n2, …} (or None)
descr: description to be included in message informing of task completion
r: handle of report file (can be None if not used)
shortest_probe(seqs,lim,host,t)¶
seqs: list of nucleotide sequences
lim: minimum probe size
host: host genome
t: description
size_dist(seqs,fname,r)¶
seqs: dictionary {seq1:n1, seq2:n2, …} where seq1, seq2: amino acid sequences, n1, n2: numbers of reads
fname: name of output file (can be None if no need to save results)
r: handle of report file (can be None if not used)
sortfiles(l,str)¶
l: list of file names to be sorted
str: string before which file names will be sorted
transl(seq)¶
seq: nucleotide sequence
Example:
transl('atgctgaaagcc')
MLKA
xcount(f,x)¶
f: file handle (file must be opened in binary mode)
x: string to be counted