Preparation



INPUT files

List
File Description Required or not
List of Accessions List of Accessions in Comma-Separated Value (CSV) file o
Reference genome (Multi) FASTA file. o
Information of reference genome Sequence information in Comma-Separated Value (CSV) file o
Variant VCF file for each sample. x
BAM(depth) Convert a BAM file to TSV (Tab-Separated Value) formatted depth file for each sample. x
General purpose data BED or BEDGraph to TSV (Tab-Separated Value) file for each sample. x
Annotation GFF file. x
information of phenotype Phenotype information (CSV)file for GWAS result visualization x
qqman output file qqman output file. x
Configure file Server environment and default value for visualization. (TASUKE package contains this file.) o


Preparation of INPUT files

List of Accessions as CSV file

User can show these information on TASUKE browser. "Accession" is essential and must be unique value because it is used as "ID" for installation of other files. Other information is optional. This information is used to indicate additional information of accessions to the browser.
"Other 1" and "Other 2" were new parameters added in this version and can be omitted.

ID Name variety Sub variety Origin Origin 2 Type [Other 1] [Other 2]
IRGC12793,Kitrana 508,japonica,ARO,Madagascar,,Elite
IRGC30416,IR 36,indica,IND,Bangladesh,,Landrace
..

Information of Reference Genome as CSV file

TASUKE make a database from variant and depth data of Chromosomes listed here. So, you can select Chromosomes to show in TASUKE Browser in this step. Length information can be obtained from reference.fa.fai file which was generated by SAMtools faidx in variant calling step.

Chromosome name Length of each Chromosome Start of centromere End of centromere
chr01,43270923,16610866,17243770
chr02,35937250,13872411,13541821
..

Variant in VCF file

One VCF file for one accession is needed. TASUKE accept VCF files generated by SAMtools and GATK. If you want to show effect of variants (e.g. non synonymous, frame shift...etc), you can add "EFF" information in "INFO" field by using snpEff. TASUKE supports a snpEff version 3.x and 4.x.

1) SAMtools

VCF files must contain values of "DP4" in INFO column and "GT" in FORMAT column which can be added by -g and -D option of samtools mpileup.

CHROM POS ID REF ALT QUAL FILTER *INFO FORMAT SAMPLE
chr01 335603 . T C 145.0 . *INFO GT:PL:DP:SP:GQ 1/1:178,30,0:10:0:57
chr01 370847 . GGTTGTTG GGTTG 214.0 . *INFO GT:PL:DP:SP:GQ 1/1:255,66,0:22:0:99
*INFO
DP=11;VDB=0.0414;AF1=1;AC1=2;DP4=0,0,3,7;MQ=46;FQ=-57;EFF=UPSTREAM(MODIFIER||||Os01g0106500|protein_coding|CODING|Os01t0106500-01|)
INDEL;DP=26;VDB=0.0395;AF1=1;AC1=2;DP4=0,0,11,11;MQ=49;FQ=-101;EFF=DOWNSTREAM(MODIFIER||||Os01g0106700|protein_coding|CODING|Os01t0106700-00|)

In VCF file generated by SAMtools,

If the variant is SNP, INFO starts with "DP=".
If the variant is Insertion or Deletion, INFO starts with "INDEL".


2) GATK

VCF files must contain values of "AD" and "GT" in FORMAT column.

CHROM POS ID REF ALT QUAL FILTER *INFO FORMAT SAMPLE
chr01 335603 . T C 688.77 . *INFO GT:AD:DP:GQ:PL 1/1:0,19:19:57:717,57,0
chr01 370847 . GA G 214.0 . *INFO GT:AD:DP:GQ:PL 1/1:0,19:19:57:717,57,0
*INFO
AC=2;AF=1.00;AN=2;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=52.03;MQ0=0;QD=29.38;EFF=DOWNSTREAM(MODIFIER||2162|||LOC_Os01g01360|||LOC_Os01g01360.1||1),INTERGENIC(MODIFIER||||||||||1)
AC=2;AF=1.00;AN=2;DP=22;FS=0.000;HaplotypeScore=124.8587;MLEAC=2;MLEAF=1.00;MQ=56.20;MQ0=0;QD=36.03;RPA=9,7;RU=A;STR;EFF=DOWNSTREAM(MODIFIER||2223|||LOC_Os01g01369|||LOC_Os01g01369.1||1),INTERGENIC(MODIFIER||||||||||1),

Convert BAM file to depth information file (TSV)

One BAM file for one accession is needed to create depth information. It can accept various sequences alignment. (e.g. Whole genome, RNA) This procedure needs samtools. And we recommend that you should do this procedure in your analysis server.

$ tasuke_bamtodepth.pl -i <BAM file> -o <output name> -c <chromosome list> -s <samtools path>

Required:
-i <BAM file> : BAM file
-o <output name> : Output depth file name (TSV)
-c <chromosome list> : Chromosome information file (CSV)
-s <samtools path> : The path of SAMtools for running samtool depth

Optional:
-bq <base quality threshold> : int (default:0)
-mq <mapping quality threshold> : int (default:0)

* Create working directory in same directory that depth file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".


Convert BED or BEDGraph file to TSV file

One tsv file for one accession is needed to create any genome information (CHIP-seq, BS-seq, RNA-seq and so on.).
We recommend that you should do this procedure in your analysis server.

$ tasuke_bedtotsv.pl -i <any file> -o <output name> -c <chromosome list>

Required:
-i <any file> : Genome information file (BED or BEDgraph)
-o <output name> : Output TSV file name
-c <chromosome list> : Chromosome information file (CSV)

Optional:
-g : It accepts input file as bedgraph format.

* Create working directory in same directory that tsv file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".

Phenotype information file for GWAS result visualization

The phenotype information file requires three information; "Breed name"(Accession name), "Phenotype"(Phenotype name), and "Phenotype Value". The format of this file is Camma-Separated Value (CSV) file.

Breed name Phenotype Phenotype Value
Name 1,phenotype 1,1.55
Name 1,phenotype 2,11.1
Name 2,phenotype 1,4.5678
Name 2,phenotype 2,9999
..
qqman file for GWAS result visualization

The qqman file generated by GWAS alalysis tools. The format of this file is Tab-Separated Value (TSV) file with chromosome name, position and p-value.

CHR BP (pos) P (p-value)
chr01 2731 0.0720126631940765
chr01 6873 0.245921508033888
chr01 24810 0.198227063325373
chr01 31071 0.498345988659771
..