Saturday, November 24, 2018

[minimap2] How to Download, Install and execute minimap2

[minimap2] How to Download, Install and execute minimap2

How to Download, Install, Execute and Make a Reference file in minimap2?

QUESTION

How to Download minimap2?
How to Install minimap2?
How to Make a Reference index in minimap2?
How to Execute minimap2?

ANSWER

Download minimap2.
Download minimap2
git clone https://github.com/lh3/minimap2.git

Install minimap2
Install minimap2
cd minimap2
make
Let’s execute minimap2 and confirm properly installed.
./minimap2
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:
Indexing:
-H use homopolymer-compressed k-mer (preferrable for PacBio)
-k INT k-mer size (no larger than 28) [15]
-w INT minizer window size [10]
-I NUM split index for every ~NUM input bases [4G]
-d FILE dump index to FILE []

Mapping:
-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]
-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]
-G NUM max intron length (effective with -xsplice; changing -r) [200k]
-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]
-r NUM bandwidth used in chaining and DP-based alignment [500]
-n INT minimal number of minimizers on a chain [3]
-m INT minimal chaining score (matching bases minus log gap penalty) [40]
-X skip self and dual mappings (for the all-vs-all mode)
-p FLOAT min secondary-to-primary score ratio [0.8]
-N INT retain at most INT secondary alignments [5]

Alignment:
-A INT matching score [2]
-B INT mismatch penalty [4]
-O INT[,INT] gap open penalty [4,24]
-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
-s INT minimal peak DP alignment score [80]
-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:
-a output in the SAM format (PAF by default)
-Q don't output base quality in SAM
-L write CIGAR with >65535 ops at the CG tag
-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
-c output CIGAR in PAF
--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]
--MD output the MD tag
--eqx write =/X CIGAR operators
-Y use soft clipping for supplementary alignments
-t INT number of threads [3]
-K NUM minibatch size for mapping [500M]
--version show version number

Preset:
-x STR preset (always applied before other options; see minimap2.1 for details) []
- map-pb/map-ont: PacBio/Nanopore vs reference mapping
- ava-pb/ava-ont: PacBio/Nanopore read overlap
- asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence
- splice: long-read spliced alignment
- sr: genomic short-read mapping

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

Make a Reference index in minimap2
Make a Reference index in minimap2
minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
This process takes about 3 minute for me. (ucsc.hg19.fasta 3GB)
You may see the following log.
minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
[M::mm_idx_gen::75.576*1.76] collected minimizers
[M::mm_idx_gen::90.135*1.96] sorted minimizers
[M::main::102.770*1.83] loaded/built the index for 93 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 93
[M::mm_idx_stat::103.674*1.82] distinct minimizers: 100029963 (38.72% are singletons); average occurrences: 5.458; average spacing: 5.746
[M::main] Version: 2.14-r886-dirty
[M::main] CMD: minimap2 -d ucsc.hg19.mmi ucsc.hg19.fasta
[M::main] Real time: 103.849 sec; CPU: 189.092 sec; Peak RSS: 11.213 GB


Execute minimap2
Execute minimap2
Suppose we are going to mapping HiSeq short read fastq file. (use -ax sr)
Presets
# use presets (no test data)

./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio CCS genomic reads
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
./minimap2 -ax splice -uf -C5 ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
minimap2 -ax sr \
-t <Thread> \
-R <ReadGroup> \
ucsc.hg19.fasta \
sample_1.fastq.gz \
sample_2.fastq.gz
Comparing with BWA-0.7.12, the mapping time reduced about 20% in HiSeq short read.


Reference
https://github.com/lh3/minimap2

Saturday, November 17, 2018

[JAVA] How to Print Date in yyyy-mm-dd format in Java, Class Date

[JAVA] How to Print Date in yyyy-mm-dd format in Java, Class Date

How to Print Date in yyyy-mm-dd format in Java, Class Date

QUESTION

How to Print Date in yyyy-mm-dd format in Java?
How to Print Time in hh:mm:ss format in Java?

ANSWER

One of the Simplest way to print date in format with Java is using Class Date.
The Class Date states a specific point in time.
It is just a container for the number in milliseconds since the UNIX epoch (January 1, 1970 00:00:00.000 GMT)
If the Date object is printed, the result would be this.
Sun Nov 18 12:34:56 GMT 2018
CurrentTime.java
import java.util.Date;

public class CurrentTime {
 public static void main(String[] args) {
  Date today = new Date();
  System.out.println(today);
  // Sun Nov 18 12:34:56 GMT 2018
 }
}
CurrentTime2.java
import java.text.SimpleDateFormat;
import java.util.Date;

public class CurrentTime2 {
 public static void main(String[] args) {
  Date today = new Date();
  System.out.println(today);
  // Sun Nov 18 12:34:56 GMT 2018

  SimpleDateFormat date = new SimpleDateFormat("yyyy-MM-dd");
  SimpleDateFormat time = new SimpleDateFormat("hh:mm:ss a");

  System.out.println("Date: "+date.format(today));
  // Date: 2018-11-18
  System.out.println("Time: "+time.format(today));
  // Time: 10:40:22 AM
 }
}


Thursday, November 15, 2018

[Python] How to sort python list string with number inside? - How to write natrual sort order in python?





Question:

How to sort python list string with number inside?
For example, [ "item1", "item13", "item15", "item7" ]

How to write natrual sort order in python?

Answer:

There are two basic ways to sort list in python.

First, the sorted function. The sorted built-in function returns sorted list. Be careful! The list itself is not sorted. [Line6, Line7]


Second, List.sorted() method. The list.sorted() method changes list itself in sorted order.


When list item contains number with string, both sorted function and list.sorted() method does not work properly.

Both line5 and line7 show improper sorted list which is not numeric order.

To figure this out, let's call list.sort() method with lambda function. [Line12]

The list.sort() method allows "key" argument which gives list.sort() method to specify the sorting criteria.



In this example, list elements are consisted with "string" + "number".

["item7", "item1", "item15", "item13"]  [Line 4]

We want sort these elements in numeric order, like..

['item1', 'item7', 'item13', 'item15']  [Line 13]

By using built-in sort function, however, sorting process seems not working properly.

['item1', 'item13', 'item15', 'item7']  [Line 5, 7]

This because 'lexicographic sorting' compares string chracter by character, which means '7' is greater than '1'.

That's why built-in sort function sorts 'item13' is less than 'item7'.


To solve this problem, we need to set the critea to python by passing parameter called "key".

The key of sort function is the number part, not the string.

All elements in this list contain string "item" and numeric value.

By slicing element from 4th index, this can obtain numeric part.

Like element[4:]

In this way, list item with "string" + "number" can be sorted by lst.sort(key=lambda x: int(x[4:])). [Line 12]








Reference:
Python List.sort() method, Key Functions.

Why do some sorting methods sort by 1, 10, 2, 3 ... ?



[Bioinformatics] How to open and count total base Fasta file in Windows?







Question:

How to open and count total base from Fasta file in Windows?


Answer:

There are two answers to open and count total base from Fasta file in Windows.


Solution1)

Read Fasta file with Python Script.



https://github.com/KennethJHan/FastaReader/blob/master/Python/FastaReader.py





Solution2)

Download FastaReader GUI v1.0

1) Download FastaReaderGui.jar

https://github.com/KennethJHan/FastaReader/tree/master/JAVA_GUI/v1.0

2) Launch FastaReaderGui.jar.
Click "Open a Fasta File" button.


3) Select and open Fasta file.


4) See the result.



Sunday, January 21, 2018

[Bioinformatics 101] 007. Function Method



Hello! This is Kenneth J Han!

In this post, we are going to look at function (or method).




007. Function Method

Problem

Create a function (or a method) called Factorial.
The Factorial has one integer parameter "num" and returns factorial value of "num".
Calculate the value of 3 factorial, 4 factorial and 5 factorial.

Pseudocode

Factorial(num)
    result ← 1
    WHILE num > 0
        result ← result * num
        num ← num - 1
    RETURN result

result3 ← Factorial(3)
result4 ← Factorial(4)
result5 ← Factorial(5)

PRINT result3, result4, result5

Answer

6  24  120


If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

[Bioinformatics 101] 006. while loop



Hello! This is Kenneth J Han!

In this post, we are going to look at while loop.




006. WHILE Loop

Problem

Using WHILE Loop, calculate 5! (factorial) .

Pseudocode

num ← 5
result ← 1

WHILE num > 0
    result ← result * num
    num ← num - 1

PRINT result

Answer

120


If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

[Bioinformatics 101] 005. for statement



Hello! This is Kenneth J Han!

In this post, we are going to look at for statement.




005. FOR Statement

Problem

Sum all integers from 1 to 10.

Pseudocode

sum ← 0

FOR i ← 1 TO 10
    sum ← sum + i
PRINT sum

Answer

55


If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

[Bioinformatics 101] 004. if else statement



Hello! This is Kenneth J Han!

In this post, we are going to look at if ... else statement.




004. IF ELSE Statement

Problem
Check whether the variable num1 is multiple of 3 or multiple of 7.
Pseudocode
num1 ← 7

IF num1 % 3 == 0
    PRINT "Multiple of 3"
ELSE IF num % 7 == 0
    PRINT "Multiple of 7"
ELSE
    PRINT "None of them"

Answer
Multiple of 7

If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

[Bioinformatics 101] 003. Operators



Hello! This is Kenneth J Han!

In this post, we are going to look at basic operators.





003. Operators

Problem
Put 7 in variable "num1", and put 2 in variable "num2".
Then calculate the two operands : "add +", "subtract -", "multiply *", "divide /", "remainder %" and power.
Pseudocode
num1 ← 7
num2 ← 2
PRINT num1 + num2
PRINT num1 - num2
PRINT num1 * num2
PRINT num1 / num2
PRINT num1 % num2
PRINT POW(num1, num2)
Answer
9
5
14
3.5
1
49

If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

[Bioinformatics 101] 002. Working with variables! - What is a variable?



Hello! This is Kenneth J Han!

In this post, we are going to calculate the area of a circle given the radius is 3.




002. Working with variables!

Problem
Calculate the area of a circle given the radius is 3.
The process of calculation should use the variables - r, PI and area.
Pseudocode
r ← 3
PI ← 3.14
area ← r * r * PI
Answer
28.26

If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github








See you on next post!

Saturday, January 20, 2018

[Bioinformatics 101] 001. Hello World!



Hello! This is Kenneth J Han! I'm really glad to meet you guys!

From this post, I'll post basic bioinformatics problems.


Problems are given with pseudocode.

This means you can write a code whatever language you want to use.

There are lots of program languages around world.

If you haven't choose your major language, I'll recommend Python, which is easy to learn but very powerful.





001. Hello World!

Problem
Print Hello, Bioinformatics
Pseudocode
PRINT "Hello, Bioinformatics"
Answer
Hello, Bioinformatics

If you have difficulties solving problem, visit the links below!
Your source code is on ready to serve!

Python source code answer on Github
Java source code answer on Github









See you on next post!

Tuesday, January 9, 2018

Introduction - Recent me - Working as a genomic data analyst


Introduction - Recent me - Working as a genomic data analyst

Hello World!

Hi, this is Kenneth J Han from South Korea.

It's been more than two years since I ended my degree and started working at Macrogen Korea.


kenneth_hanFrom there, I analyze genomic data with computer and build pipeline for big data analysis.


Work as a genome analyst is perfect for me since my major is biological science and I'm fond of doing something with computer.


Starting a blog is one of my dreams and from now on I'll not only post Bioinformatics contents but also write programming language, such as Python/JAVA/C and Web development.


I hope you guys enjoy this site.


See you in the next post :) Bye