Saturday, November 24, 2018

[minimap2] How to Download, Install and execute minimap2

[minimap2] How to Download, Install and execute minimap2

How to Download, Install, Execute and Make a Reference file in minimap2?

QUESTION

How to Download minimap2?
How to Install minimap2?
How to Make a Reference index in minimap2?
How to Execute minimap2?

ANSWER

Download minimap2.
Download minimap2
git clone https://github.com/lh3/minimap2.git

Install minimap2
Install minimap2
cd minimap2
make
Let’s execute minimap2 and confirm properly installed.
./minimap2
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:
Indexing:
-H use homopolymer-compressed k-mer (preferrable for PacBio)
-k INT k-mer size (no larger than 28) [15]
-w INT minizer window size [10]
-I NUM split index for every ~NUM input bases [4G]
-d FILE dump index to FILE []

Mapping:
-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]
-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]
-G NUM max intron length (effective with -xsplice; changing -r) [200k]
-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]
-r NUM bandwidth used in chaining and DP-based alignment [500]
-n INT minimal number of minimizers on a chain [3]
-m INT minimal chaining score (matching bases minus log gap penalty) [40]
-X skip self and dual mappings (for the all-vs-all mode)
-p FLOAT min secondary-to-primary score ratio [0.8]
-N INT retain at most INT secondary alignments [5]

Alignment:
-A INT matching score [2]
-B INT mismatch penalty [4]
-O INT[,INT] gap open penalty [4,24]
-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
-s INT minimal peak DP alignment score [80]
-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:
-a output in the SAM format (PAF by default)
-Q don't output base quality in SAM
-L write CIGAR with >65535 ops at the CG tag
-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
-c output CIGAR in PAF
--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]
--MD output the MD tag
--eqx write =/X CIGAR operators
-Y use soft clipping for supplementary alignments
-t INT number of threads [3]
-K NUM minibatch size for mapping [500M]
--version show version number

Preset:
-x STR preset (always applied before other options; see minimap2.1 for details) []
- map-pb/map-ont: PacBio/Nanopore vs reference mapping
- ava-pb/ava-ont: PacBio/Nanopore read overlap
- asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence
- splice: long-read spliced alignment
- sr: genomic short-read mapping

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

Make a Reference index in minimap2
Make a Reference index in minimap2
minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
This process takes about 3 minute for me. (ucsc.hg19.fasta 3GB)
You may see the following log.
minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
[M::mm_idx_gen::75.576*1.76] collected minimizers
[M::mm_idx_gen::90.135*1.96] sorted minimizers
[M::main::102.770*1.83] loaded/built the index for 93 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 93
[M::mm_idx_stat::103.674*1.82] distinct minimizers: 100029963 (38.72% are singletons); average occurrences: 5.458; average spacing: 5.746
[M::main] Version: 2.14-r886-dirty
[M::main] CMD: minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
[M::main] Real time: 103.849 sec; CPU: 189.092 sec; Peak RSS: 11.213 GB


Execute minimap2
Execute minimap2
Suppose we are going to mapping HiSeq short read fastq file. (use -ax sr)
Presets
# use presets (no test data)

./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio CCS genomic reads
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
./minimap2 -ax splice -uf -C5 ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
minimap2 -ax sr \
-t <Thread> \
-R <ReadGroup> \
ucsc.hg19.fasta \
sample_1.fastq.gz \
sample_2.fastq.gz
Comparing with BWA-0.7.12, the mapping time reduced about 20% in HiSeq short read.


Reference
https://github.com/lh3/minimap2

Saturday, November 17, 2018

[JAVA] How to Print Date in yyyy-mm-dd format in Java, Class Date

[JAVA] How to Print Date in yyyy-mm-dd format in Java, Class Date

How to Print Date in yyyy-mm-dd format in Java, Class Date

QUESTION

How to Print Date in yyyy-mm-dd format in Java?
How to Print Time in hh:mm:ss format in Java?

ANSWER

One of the Simplest way to print date in format with Java is using Class Date.
The Class Date states a specific point in time.
It is just a container for the number in milliseconds since the UNIX epoch (January 1, 1970 00:00:00.000 GMT)
If the Date object is printed, the result would be this.
Sun Nov 18 12:34:56 GMT 2018
CurrentTime.java
import java.util.Date;

public class CurrentTime {
 public static void main(String[] args) {
  Date today = new Date();
  System.out.println(today);
  // Sun Nov 18 12:34:56 GMT 2018
 }
}
CurrentTime2.java
import java.text.SimpleDateFormat;
import java.util.Date;

public class CurrentTime2 {
 public static void main(String[] args) {
  Date today = new Date();
  System.out.println(today);
  // Sun Nov 18 12:34:56 GMT 2018

  SimpleDateFormat date = new SimpleDateFormat("yyyy-MM-dd");
  SimpleDateFormat time = new SimpleDateFormat("hh:mm:ss a");

  System.out.println("Date: "+date.format(today));
  // Date: 2018-11-18
  System.out.println("Time: "+time.format(today));
  // Time: 10:40:22 AM
 }
}


Thursday, November 15, 2018

[Python] How to sort python list string with number inside? - How to write natrual sort order in python?





Question:

How to sort python list string with number inside?
For example, [ "item1", "item13", "item15", "item7" ]

How to write natrual sort order in python?

Answer:

There are two basic ways to sort list in python.

First, the sorted function. The sorted built-in function returns sorted list. Be careful! The list itself is not sorted. [Line6, Line7]


Second, List.sorted() method. The list.sorted() method changes list itself in sorted order.


When list item contains number with string, both sorted function and list.sorted() method does not work properly.

Both line5 and line7 show improper sorted list which is not numeric order.

To figure this out, let's call list.sort() method with lambda function. [Line12]

The list.sort() method allows "key" argument which gives list.sort() method to specify the sorting criteria.



In this example, list elements are consisted with "string" + "number".

["item7", "item1", "item15", "item13"]  [Line 4]

We want sort these elements in numeric order, like..

['item1', 'item7', 'item13', 'item15']  [Line 13]

By using built-in sort function, however, sorting process seems not working properly.

['item1', 'item13', 'item15', 'item7']  [Line 5, 7]

This because 'lexicographic sorting' compares string chracter by character, which means '7' is greater than '1'.

That's why built-in sort function sorts 'item13' is less than 'item7'.


To solve this problem, we need to set the critea to python by passing parameter called "key".

The key of sort function is the number part, not the string.

All elements in this list contain string "item" and numeric value.

By slicing element from 4th index, this can obtain numeric part.

Like element[4:]

In this way, list item with "string" + "number" can be sorted by lst.sort(key=lambda x: int(x[4:])). [Line 12]








Reference:
Python List.sort() method, Key Functions.

Why do some sorting methods sort by 1, 10, 2, 3 ... ?



[Bioinformatics] How to open and count total base Fasta file in Windows?







Question:

How to open and count total base from Fasta file in Windows?


Answer:

There are two answers to open and count total base from Fasta file in Windows.


Solution1)

Read Fasta file with Python Script.



https://github.com/KennethJHan/FastaReader/blob/master/Python/FastaReader.py





Solution2)

Download FastaReader GUI v1.0

1) Download FastaReaderGui.jar

https://github.com/KennethJHan/FastaReader/tree/master/JAVA_GUI/v1.0

2) Launch FastaReaderGui.jar.
Click "Open a Fasta File" button.


3) Select and open Fasta file.


4) See the result.