TASUKE - Preparation

Preparation

INPUT files

List

File	Description	Required or not
List of Accessions	List of Accessions in Comma-Separated Value (CSV) file	o
Reference genome	(Multi) FASTA file.	o
Information of reference genome	Sequence information in Comma-Separated Value (CSV) file	o
Variant	VCF file for each sample.	x
BAM(depth)	Convert a BAM file to TSV (Tab-Separated Value) formatted depth file for each sample.	x
General purpose data	BED or BEDGraph to TSV (Tab-Separated Value) file for each sample.	x
Annotation	GFF file.	x
Configure file	Server environment and default value for visualization. (TASUKE package contains this file.)	o

You have to unify the "chromosome names" in all of INPUT files.

Preparation of INPUT files

List of Accessions as CSV file

User can show these information on TASUKE browser. "Accession" is essential and must be unique value because it is used as "ID" for installation of other files. Other information is optional. This information is used to indicate additional information of accessions to the browser.

ID	Name	variety	Sub variety	Origin	Origin 2	Type

IRGC12793,Kitrana 508,japonica,ARO,Madagascar,,Elite
IRGC30416,IR 36,indica,IND,Bangladesh,,Landrace
..

Information of Reference Genome as CSV file

TASUKE make a database from variant and depth data of Chromosomes listed here. So, you can select Chromosomes to show in TASUKE Browser in this step. Length information can be obtained from reference.fa.fai file which was generated by SAMtools faidx in variant calling step.

Chromosome name	Length of each Chromosome	Start of centromere	End of centromere

chr01,43270923,16610866,17243770
chr02,35937250,13872411,13541821
..

If you don't have information about centromere start and end positions set "0" for both values.

TASUKE does not accept the chromosome name which is included "."(dot) . So replace "." to "_"(underscore) or "-"(hyphen).

Variant in VCF file

One VCF file for one accession is needed. TASUKE accept VCF files generated by SAMtools and GATK. If you want to show effect of variants (e.g. non synonymous, frame shift...etc), you can add "EFF" information in "INFO" field by using snpEff. TASUKE supports a snpEff version 3.x and 4.x.

1) SAMtools

VCF files must contain values of "DP4" in INFO column and "GT" in FORMAT column which can be added by -g and -D option of samtools mpileup.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	*INFO	FORMAT	SAMPLE
chr01	335603	.	T	C	145.0	.	*INFO	GT:PL:DP:SP:GQ	1/1:178,30,0:10:0:57
chr01	370847	.	GGTTGTTG	GGTTG	214.0	.	*INFO	GT:PL:DP:SP:GQ	1/1:255,66,0:22:0:99

*INFO

DP=11;VDB=0.0414;AF1=1;AC1=2;DP4=0,0,3,7;MQ=46;FQ=-57;EFF=UPSTREAM(MODIFIER||||Os01g0106500|protein_coding|CODING|Os01t0106500-01|)
INDEL;DP=26;VDB=0.0395;AF1=1;AC1=2;DP4=0,0,11,11;MQ=49;FQ=-101;EFF=DOWNSTREAM(MODIFIER||||Os01g0106700|protein_coding|CODING|Os01t0106700-00|)

In VCF file generated by SAMtools,

If the variant is SNP, INFO starts with "DP=".
If the variant is Insertion or Deletion, INFO starts with "INDEL".

2) GATK

VCF files must contain values of "AD" and "GT" in FORMAT column.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	*INFO	FORMAT	SAMPLE
chr01	335603	.	T	C	688.77	.	*INFO	GT:AD:DP:GQ:PL	1/1:0,19:19:57:717,57,0
chr01	370847	.	GA	G	214.0	.	*INFO	GT:AD:DP:GQ:PL	1/1:0,19:19:57:717,57,0

*INFO

AC=2;AF=1.00;AN=2;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=52.03;MQ0=0;QD=29.38;EFF=DOWNSTREAM(MODIFIER||2162|||LOC_Os01g01360|||LOC_Os01g01360.1||1),INTERGENIC(MODIFIER||||||||||1)
AC=2;AF=1.00;AN=2;DP=22;FS=0.000;HaplotypeScore=124.8587;MLEAC=2;MLEAF=1.00;MQ=56.20;MQ0=0;QD=36.03;RPA=9,7;RU=A;STR;EFF=DOWNSTREAM(MODIFIER||2223|||LOC_Os01g01369|||LOC_Os01g01369.1||1),INTERGENIC(MODIFIER||||||||||1),

Convert BAM file to depth information file (TSV)

One BAM file for one accession is needed to create depth information. It can accept various sequences alignment. (e.g. Whole genome, RNA) This procedure needs samtools. And we recommend that you should do this procedure in your analysis server.

$ tasuke_bamtodepth.pl -i <BAM file> -o <output name> -c <chromosome list> -s <samtools path>

Required:
-i <BAM file> : BAM file
-o <output name> : Output depth file name (TSV)
-c <chromosome list> : Chromosome information file (CSV)
-s <samtools path> : The path of SAMtools for running samtool depth

Optional:
-bq <base quality threshold> : int (default:0)
-mq <mapping quality threshold> : int (default:0)

* Create working directory in same directory that depth file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".

Convert BED or BEDGraph file to TSV file

One tsv file for one accession is needed to create any genome information (CHIP-seq, BS-seq, RNA-seq and so on.).
We recommend that you should do this procedure in your analysis server.

$ tasuke_bedtotsv.pl -i <any file> -o <output name> -c <chromosome list>

Required:
-i <any file> : Genome information file (BED or BEDgraph)
-o <output name> : Output TSV file name
-c <chromosome list> : Chromosome information file (CSV)

Optional:
-g : It accepts input file as bedgraph format.

* Create working directory in same directory that tsv file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".

TASUKE multiple genome browser

Preparation

INPUT files

Preparation of INPUT files