TASUKE - Preparation

Preparation

INPUT files

List

File	Description	Required or not
List of Accessions	List of Accessions in Comma-Separated Value (CSV) file	o
Reference genome	(Multi) FASTA file.	o
Information of reference genome	Sequence information in Comma-Separated Value (CSV) file	o
Variant	VCF file for each sample.	x
BAM(depth)	Convert a BAM file to TSV (Tab-Separated Value) formatted depth file for each sample.	x
General purpose data	BED or BEDGraph to TSV (Tab-Separated Value) file for each sample.	x
Annotation	GFF file.	x
information of phenotype	Phenotype information (CSV)file for GWAS result visualization	x
qqman output file	qqman output file.	x
Configure file	Server environment and default value for visualization. (TASUKE package contains this file.)	o

You have to unify the "chromosome names" in all of INPUT files.

We confirmed that the genome data(reference sequence and annotation(gff)) of the model species of public DB can be registered correctly for the following sites. (last accessed 5th Aug. 2019)

Preparation of INPUT files

List of Accessions as CSV file

User can show these information on TASUKE browser. "Accession" is essential and must be unique value because it is used as "ID" for installation of other files. Other information is optional. This information is used to indicate additional information of accessions to the browser.
"Other 1" and "Other 2" were new parameters added in this version and can be omitted.

ID	Name	variety	Sub variety	Origin	Origin 2	Type	[Other 1]	[Other 2]

IRGC12793,Kitrana 508,japonica,ARO,Madagascar,,Elite
IRGC30416,IR 36,indica,IND,Bangladesh,,Landrace
..

Each parameter has a maximum character length (in half-width):
ID=40, Name=40, variety=40, SubVariety=40, Origin=20, Origin2=30, Type=40, Other1=(NoLimit), Other2=(NoLimit)

Information of Reference Genome as CSV file

TASUKE make a database from variant and depth data of Chromosomes listed here. So, you can select Chromosomes to show in TASUKE Browser in this step. Length information can be obtained from reference.fa.fai file which was generated by SAMtools faidx in variant calling step.

Chromosome name	Length of each Chromosome	Start of centromere	End of centromere

chr01,43270923,16610866,17243770
chr02,35937250,13872411,13541821
..

If you don't have information about centromere start and end positions set "0" for both values.

Variant in VCF file

One VCF file for one accession or a multi-sample VCF is needed. TASUKE accept VCF files generated by SAMtools and GATK. If you want to show effect of variants (e.g. non synonymous, frame shift...etc), you can add "EFF" information in "INFO" field by using snpEff. TASUKE supports a snpEff version 3.x and 4.x.

1) SAMtools

VCF files must contain values of "DP4" in INFO column and "GT" in FORMAT column which can be added by -g and -D option of samtools mpileup.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	*INFO	FORMAT	SAMPLE
chr01	335603	.	T	C	145.0	.	*INFO	GT:PL:DP:SP:GQ	1/1:178,30,0:10:0:57
chr01	370847	.	GGTTGTTG	GGTTG	214.0	.	*INFO	GT:PL:DP:SP:GQ	1/1:255,66,0:22:0:99

*INFO

DP=11;VDB=0.0414;AF1=1;AC1=2;DP4=0,0,3,7;MQ=46;FQ=-57;EFF=UPSTREAM(MODIFIER||||Os01g0106500|protein_coding|CODING|Os01t0106500-01|)
INDEL;DP=26;VDB=0.0395;AF1=1;AC1=2;DP4=0,0,11,11;MQ=49;FQ=-101;EFF=DOWNSTREAM(MODIFIER||||Os01g0106700|protein_coding|CODING|Os01t0106700-00|)

In VCF file generated by SAMtools,

If the variant is SNP, INFO starts with "DP=".
If the variant is Insertion or Deletion, INFO starts with "INDEL".

2) GATK

VCF files must contain values of "AD" and "GT" in FORMAT column.
The multi-sample VCF format is supported, and two DB modes can be selected: "Convert each accession to a single-sample VCF" and "Use multi-sample VCF information as is". See here for details.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	*INFO	FORMAT	SAMPLE
chr01	335603	.	T	C	688.77	.	*INFO	GT:AD:DP:GQ:PL	1/1:0,19:19:57:717,57,0
chr01	370847	.	GA	G	214.0	.	*INFO	GT:AD:DP:GQ:PL	1/1:0,19:19:57:717,57,0

*INFO

AC=2;AF=1.00;AN=2;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=52.03;MQ0=0;QD=29.38;EFF=DOWNSTREAM(MODIFIER||2162|||LOC_Os01g01360|||LOC_Os01g01360.1||1),INTERGENIC(MODIFIER||||||||||1)
AC=2;AF=1.00;AN=2;DP=22;FS=0.000;HaplotypeScore=124.8587;MLEAC=2;MLEAF=1.00;MQ=56.20;MQ0=0;QD=36.03;RPA=9,7;RU=A;STR;EFF=DOWNSTREAM(MODIFIER||2223|||LOC_Os01g01369|||LOC_Os01g01369.1||1),INTERGENIC(MODIFIER||||||||||1),

Convert BAM file to depth information file (TSV)

One BAM file for one accession is needed to create depth information. It can accept various sequences alignment. (e.g. Whole genome, RNA) This procedure needs samtools. And we recommend that you should do this procedure in your analysis server.

$ tasuke_bamtodepth.pl -i <BAM file> -o <output name> -c <chromosome list> -s <samtools path>

Required:
-i <BAM file> : BAM file
-o <output name> : Output depth file name (TSV)
-c <chromosome list> : Chromosome information file (CSV)
-s <samtools path> : The path of SAMtools for running samtool depth

Optional:
-bq <base quality threshold> : int (default:0)
-mq <mapping quality threshold> : int (default:0)

* Create working directory in same directory that depth file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".

Convert BED or BEDGraph file to TSV file

One tsv file for one accession is needed to create any genome information (CHIP-seq, BS-seq, RNA-seq and so on.).
We recommend that you should do this procedure in your analysis server.

$ tasuke_bedtotsv.pl -i <any file> -o <output name> -c <chromosome list>

Required:
-i <any file> : Genome information file (BED or BEDgraph)
-o <output name> : Output TSV file name
-c <chromosome list> : Chromosome information file (CSV)

Optional:
-g : It accepts input file as bedgraph format.

* Create working directory in same directory that tsv file. After the processing is finished, the directory will be deleted.
* Chromosome list is using as a input file of "tasuke_init.pl".

Phenotype information file for GWAS result visualization

The phenotype information file requires three information; "Breed name"(Accession name), "Phenotype"(Phenotype name), and "Phenotype Value". The format of this file is Camma-Separated Value (CSV) file.

Breed name	Phenotype	Phenotype Value

Name 1,phenotype 1,1.55
Name 1,phenotype 2,11.1
Name 2,phenotype 1,4.5678
Name 2,phenotype 2,9999
..

qqman file for GWAS result visualization

The qqman file generated by GWAS alalysis tools. The format of this file is Tab-Separated Value (TSV) file with chromosome name, position and p-value.

CHR	BP (pos)	P (p-value)

chr01 2731 0.0720126631940765
chr01 6873 0.245921508033888
chr01 24810 0.198227063325373
chr01 31071 0.498345988659771
..

TASUKE+ multiple genome browser for variants and GWAS

Preparation

INPUT files

Preparation of INPUT files