Installation
Outline
Here we explain commands of TASUKE for new installation and making database. If you want to update the TASUKE+ you are already using, please see "How to update".
The time for creation of databases depends on genome size, number of samples and the power of server computer.
Steps 5 and 7 are repeated for each accession.
We prepared a shell script as unified tool for installation. It automatically finds the files from specified directory and conducts these process. (it does not support GWAS and System phylogenetic tree.)
More detail: Unified installer
TASUKE browser requires the LAMP server. And it requires the Linux server that has Apache, MySQL5.0 or later with mysqli module and PHP5.3 or later.
First, configure PHP settings. Allows PHP code to be executed on .html file.
For CentOS8(Almalinux8) or later, set as follows.If "php-fpm" is not installed, please install and enable it. If the "yum list php-fpm" command returns "Installed", it is installed.
$ systemctl start php-fpm
$ systemctl enable php-fpm
<IfModule !mod_php.c>
......
<FilesMatch \.(php|phar|html)$>
......
</IfModule>
<IfModule mod_php7.c>
......
<FilesMatch \.(php|phar|html)$>
......
</IfModule>
security.limit_extensions = .php .php3 .php4 .php5 .php7 .html
$ systemctl restart php-fpm
For CentOS7(RHEL7) and earlier, the settings were as follows.
Modifying '/etc/httpd/conf.d/php.conf' and restart httpd:
AddHandler php5-script .php AddHandler php5-script .php .html
•using php7.x
AddHandler php7-script .php AddHandler php7-script .php .html
For Ubuntu, add the following to the end of '/etc/apache2/apache2.conf' and restart httpd:
SetHandler application/x-httpd-php
</FilesMatch>
Next, install additional modules as needed. These are often installed by default, but if they are not, please install them.
If php-mysql is not installed, install it. If the result of the command "php -m" contains "mysqlnd" or "mysqli", it is installed.$ yum install php-mysql
•CentOS php5.5 or later
$ yum install php-mysqlnd
•Ubuntu
$ apt install php-mysql
If php-json is not installed(you're using php5.1(or earlier) or php7.x), please install it. If the result of the command "php -m" contains "json", it is installed.
e.g.) If you want to install curl module on Ubuntu PHP7.4.
$ yum install perl-DBD-MySQL
Additionally, when dealing with large datasets (hundreds of accessions or more), additional settings may be required for MySQL and PHP. See here.
If you are currently using legacy TASUKE and you are updating it to TASUKE+, you will need to upgrade your database schema.
This operation is not necessary if you are installing TASUKE+ newly or are already using TASUKE+.
Update your database schema as follows:
$ tasuke_update_for_gwas.pl -db <database name>
-u <user>
-p <password>
<database name>
: Database name for TASUKE-u
<user>
: User name-p
<password>
: Password for the databaseHere you create a MySQL database. First, you log-in to mysql with root authority and create a database. "database name" used here is used following installation steps.
$ mysql -u <user>
-p
> Enter password: <password>
$ mysql> create database <database name>
;
$ mysql> exit;
$ tasuke_update_for_gwas.pl -db <database name>
-u <user>
-p <password>
<database name>
: Database name for TASUKE-u
<user>
: User name-p
<password>
: Password for the databaseThis tool creates several tables on your database for TASUKE.
$ tasuke_init.pl -db <database name>
-u <user>
-p <password>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the tables from database.
> Input csv file about chromosome's information below.
> Where is the csv file? > <information of reference genome(.csv)>
> Are you sure creating database [y|n] > y
$autoDetectSerialTable = 1;
When you missed in chromosome or accession list, you must create MySQL database again.
$ mysql -u <user>
-p
> Enter password: <your password>
$ mysql> drop database <database name>
;
$ mysql> create database <database name>
;
$ mysql> exit;
This tool registers the accessions to database.
$ tasuke_accession.pl -db <database name>
-u <user>
-p <password>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the accessions from database.
> Input csv file about list of accessions below.
> Where is the csv file? > <accession list(.csv)>
:
> Are you sure adding or updating database?[y|n] > y
$ tasuke_accession.pl -r -db <database name>
-u <user>
-p <password>
-----------------------------------
Deleting accession from database
-----------------------------------
Input csv file about list of accessions below.
* WARNING : This process deletes not only accession information but also depth and variant data.
This tool sets reference genome to database.
$ tasuke_ref.pl -db <database name>
-u <user>
-p <password>
-f <reference genome>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
-f <reference genome>
: FASTA formatted reference genome file
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the reference genome from database.
$ tasuke_ref.pl -r -db <database name>
-u <user>
-p <password>
This tool sets variants to database.
$ tasuke_variant_vcf.pl -db <database name>
-u <user>
-p <password>
-n <ID>
-f <VCF file>
-t 'samtools' or 'freebayes' or 'gatk' or 'gatkm'
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
-n <ID>
: Destination ID (accession). For "-t gatkm", a comma-separated list of IDs
-f <variant file>
: Variant infromation (.VCF)
-t 'samtools' or 'gatk' or 'gatkm' : Set the program name that generated VCF file to this section
'gatkm' means multi sample VCF file generated by GATK.
Optional:
-z : For "-t gatkm". Register GT:0/0 variant(not by default)
-h <remote host>
: To connect remote host name
-r : Delete the variants from database.
Apart from the above, there is a parallel wrapper script that is faster when registering large multi-sample VCF(tens to thousands). Details are described later.
When you set "gatkm" for the program name(-t), Specify a comma-separated list of accessionIDs for "-n" (no spaces). The order of accessionIDs should match the order of the corresponding samples in the VCF file (sample names in VCF are not used), If IDs is less than the number of samples in VCF, ID is mapped from the first sample and the excess sample is ignored. If you want to ignore registering samples at the beginning or in the middle of the columns, write only commas like "-n ,,,ID1,ID2,,ID4".
A multi-sample VCF("-t gatkm") file contains a GT:0/0 variant, but it is not registered in DB by default as it will increase data size and reduce performance. Add '-z' option to register GT:0/0 variant. GT:0/0 variant will be displayed on the track in GT color mode.
When you want to input a VCF file again, you can delete it with '-r' option.
$ tasuke_variant_vcf.pl -r -db <database name>
-u <user>
-p <password>
-n <ID>
"-k" option only checks the correspondence between VCF samples and DB AccessionIDs. DB registration is not performed. It is strongly recommended to perform this check before DB registration.
$ tasuke_variant_vcf_multi.pl -db
<database name>
-u <user>
-p <password>
-f <VCF file>
Either required (*1):
-n
<IDs>
: Comma-separated AccessionID list (Corresponds to the order of VCF sample names) (*2)-m
<Path>
: Path to "SampleName > AccID" correspondence table file. VCFsampleName[,]AccID[\n]...(none) : Consider VCF sample name as AccessionID.
Other options:
-h
<remote host>
:(default: localhost)-z : Register GT:0/0 variant(not by default)
-t
<num>
: Number of threads(default: 4)-k : TEST mode. Check AccessionIDs, and output commands for each thread, but do not perform DB registration.
Sub-action:
-r : Delete variant information for the specified AccessionIDs. AccessionID itself is not deleted. A multi-sample VCF previously used for DB registration must be specified with "-f".
(*1) priority is n > m, Only the AccIDs specified here will be registered.
(*2) See description in tasuke_variant_vcf.pl on this Wiki.
VcfSample2,DbAccId2
VcfSample3,DbAccId3
......
This tool sets depth information to database. First you need to create TSV files from your BAMs (see Preparation section).
$ tasuke_tsv_db.pl -db <database name>
-u <user>
-p <password>
-n <ID>
-f <depth file>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
-n <ID>
: Destination ID (accession)
-f <Depth file>
: TSV formatted depth information file
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the variants from database.
A faster parallel wrapper script is available for registering a large number of samples (tens to hundreds). Details are below.
You can delete a TSV file with '-r' option.
$ tasuke_tsv_db.pl -r -db <database name>
-u <user>
-p <password>
-n <ID>
"-k" option only checks the correspondence between TSV samples and DB AccessionIDs. DB registration is not performed. It is strongly recommended to perform this check before DB registration.
tasuke_tsv_db_multi.pl -db
<database>
-u <user>
-p <password>
-d
<DirPath>
: Directory path where the TSV files are located. By default, filename(without extensions) is considered as AccessionID.-m
<Path>
: Path to "TsvFilePath > AccID" correspondence table. If "-d" is specified, TsvFilePath is its relative path. TsvFilePath[,]AccID[\n]...Other options:
-h
<remote host>
:(default: localhost)-t
<num>
: Number of threads(default: 4)-k : TEST mode. Check AccessionIDs, but do not perform DB registration.
Sub-action:
-r : Delete depth information for the specified AccessionIDs. AccessionID itself is not deleted.
TsvFilePath2,DbAccId2
TsvFilePath3,DbAccId3
......
This tool inputs any kind of TSV formatted NGS data. To input the general purpose track, you can do it by using tasuke_tsv_db.pl with '-c' option. First you need to create TSV files from your BED or BEDgraph files (see Preparation section).
$ tasuke_tsv_db.pl -c -db <database name>
-u <user>
-p <password>
-n <ID>
-f <tsv file>
$ tasuke_tsv_db.pl -r -c -db <database name>
-u <user>
-p <password>
-n <ID>
If you want to set any multiple conditions to the general purpose track, try following command. And load a TSV file using tasuke_tsv_db.pl.
$ tasuke_add_condition.pl -db <database name>
-u <user>
-p <password>
-n <ID>
-f <depth file>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
-c <condition_id>
: Condition ID(name)
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the conditon and tables from database.
The annotation track on the TASUKE browser can be added from GFF files.
$ tasuke_track_gff.pl -db <database name>
-u <user>
-p <password>
-f <annotation file>
-t <track name>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
-p <password>
: Password for the database
-f <annotation file>
: GFF(3) formatted file
-t <track name>
: It sets here is directoly used for track name on TASUKE
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the annotations from database.
$ tasuke_track_gff.pl -r -db <database name>
-u <user>
-p <password>
-t <track name>
The phenotype data can be added for using GWAS function on TASUKE.
$ tasuke_phenotype.pl -db <database name>
-u <user>
Required:
-db <database name>
: Database name for TASUKE
-u <user>
: User name
Optional:
-h <remote host>
: To connect remote host name
-r : Delete the phenotype data from database.
<password>
> Where is phenotype data csv file?
(File format: Breed Name,Phenotype,Phenotype Value)
#
<phenotype data file(.csv)>
> Where is qqman output file?
#
<qqman output file(.txt or .tsv)>
> What is a phenotype of the file?
#
<phenotype>
> Completed.
> Read the next file? [y/N]
#
y: > Where is qqman output file?
# N: Done.
$phenotypeFlg = "true";
To use the System phylogenetic tree, You need to create a distance matrix and/or Newick and set its file path in the config file.
The characteristics of each methods are as follows.
- When using a distance matrix
-
- This method uses Ajax to perform NJ clustering and create a tree on the server side. Clustering takes a long time for large dataset(>400samples). Speeding up is possible by introducing PHYLIP.
- Midpoint rooting is performed.
- When using a Newick
-
- This method quickly creates tree on the browser side without Ajax or NJ clustering. If you exclude Accessions from a Track, simply remove the leaves from the original Newick tree and recreate it.
- No midpoint rooting is performed.
- 10-1. Create a distance matrix
-
There are two ways to create a distance matrix file.
- 10-1-1. Use tasuke_tree_dmatrix.pl
-
Use script included in TASUKE+ package to create a distance matrix from TASUKE database contents. Variant information (and Depth if necessary) must be registered in the DB.
This script calculates distances by comparing the presence or absence of variants between accessions.Command:
$ tasuke_tree_dmatrix.pl -db
<database name>
-u<user>
-p<password>
-o<outfile>
Required:
-db<database name>
: Database name for TASUKE
-u<user>
: User name
-p<password>
: Password for the database
-o<outfile>
: Distance matrix path
Optional:
-h<remote host>
: To connect remote host name
-r<order file>
: Target accession list file(Generally tasuke_www/conf/order.conf) (Default: all Accessions)
-c<target chrs>
: Target chromosomes separated by commas(Default: all Chromosomes)
-m<calc method>
: Distance matrix calculation method. simple(default)/jaccard/dice/soergel
-a : Check DEPTH=NULL and if so, set "NA" to that position. It takes a lot of time. If DEPTH info is not registered, This option will have no effect.
-b : (This option is invalid, but left for compatibility.)
-n : [Use with "-a"] Cross all accessions and skip Column where NA exists
-l : Leave binary table file with name "<outfile>
.btbl". This file can be reused for distance calculation by converter/makeBinaryDistanceMatrix
-t<tmpdir>
: Temporary directory for creating binary table. Large datasets may require tens to hundreds GB.(Default: /tmp)
Running this script usually takes tens of minutes. You can create a more accurate distance matrix by specifying "-a" option, but it can take hours to days.
It takes a lot of time to create binary table, which is a product on the way, but it takes only a short time to create a distance matrix from it. You can specify "-l" option to leave the binary table file unerased and resume from the distance matrix creation step.If you left the binary table file with "-l" option, you can recreate distance matrix with the command below. You can respecify "-m" and "-n" options. Distance matrix is output to STDOUT.
Command:
$ converter/makeBinaryDistanceMatrix -i
<outfile.btbl>
[optional] ><outfile2>
Required:
-i<outfile.btbl>
: Binary table file(0/1 CSV table)Optional:
-m<calc method>
: Distance matrix calculation method. simple(default)/jaccard/dice/soergel
-n : Crosses all accessions and skips aggregation for position columns with DEPTH=NULL.
- 10-1-2. Prepare a distance matrix in your own way
-
You can use a distance matrix created by an external analysis tools(R, etc.). Distance matrix format must be square matrix or lower-triangular (There is no 10-character limit for sample name). AccessionID must be used as sample name and must include all Accessions used by TASUKE.
In the next step, set distance matrix path to a 'tasuke_www/conf/config.php'.
Modifying tasuke_www/conf/config.php. Then "Reset" from the TASUKE top menu.<outfile>
must be placed in a location that the www user has permission to read.$distanceMatrixPath = "<outfile>
";
- 10-2. Creating a Newick
-
There are three ways to create a Newick file. Methods 1 and 2 require "TASUKE environment that is capable of web browsing and has a distance matrix set".
- 10-2-1. Use TASUKE's "Export Current SystemTree" function
-
* This way can be performed throw web browser after completing the installation of TASUKE+.
This way is generally recommended.- Access TASUKE and display SystemTree. (For large dataset with thousands of samples, it may take several tens of minutes to draw the tree)
- Open "Settings > Accession Manager" from the top menu, and set "ID or Name" to "ID" and "Subtitle" to "---(None)" in "Accession title". If the tree nodes are collapsed, press the "Expand all nodes" button.
- Click "Tools > Export > Current SystemTree" from the top menu to download Newick.
- Upload Newick file to the web server using an sftp client (e.g. WinSCP).
- 10-2-2. Use getNewick.php
-
Create Newick by directly executing TASUKE's web content (PHP script) on the command line. If you have PHYLIP installed, you can get midpoint rooted tree.
If creating a SystemTree on a web browser takes a long time and times out, please use this way.$ cd<TASUKE document root>
/bin
$ php getNewick.php -d ><outfile>
- 10-2-3. Prepare Newick in your own way
-
You can use a Newick created by an external analysis tools(R, etc.). AccessionID must be used as sample name and must include all Accessions used by TASUKE.
For Newick format details, Please see here.
In the next step, set newick path to a 'tasuke_www/conf/config.php'.
Modifying tasuke_www/conf/config.php. Then "Reset" from the TASUKE top menu.<outfile>
must be placed in a location that the www user has permission to read.$newickPath = "<outfile>
";
This tool supports installation of TASUKE. It automatically detects any datasets and load the data to a database. It treats each file name as registered ID. Before running the tool, confirm relation of file names and accession ID.
Unified installer does not support GWAS and system phylogenetic tree registration. Also, this tool assumes that multiple single-sample VCF files are used for Variant registration. Please perform DB registration by multi-sample VCF manually separately.
$ install.sh <TASK>
<Option>
TASK (Required):
all : All installation processes
init : Setting defalut tables to a database
acc : Accession informtaion
ref : Reference sequence
ann : Annotation
var : Variants
tsv : Read depth or General purpose track (defalut: read depth)
Option:
-h : Help
-r : Delete specified datasets from the database.
-g : TSV file load to general purpose track.
Set your server environment to a 'install.conf' to run the 'install.sh'. And place the install.conf in same directory as install.sh.
Modifying install.conf
##### Configuration #####
#Path of 'tasuke_bin'
SCRIPTS='/PATH/tasuke_bin/'
#Database
#mysql or oracle
BACKEND='mysql
or oracle
'
#Database connection
DB=<database name>
USER=<user>
PASS=<password>
#For oracle
TABLESP=<tablespace name>
#Directory for datasets
# 'install.sh' searches for datasets in following directories. And it set the datasets to the database.
# For example, this tool searches for VCF file in './tasuke_sample_data/variants/', when setting variants to the database.
#Datasets
DATADIR='/PATH/tasuke_sample_data/'
#Enter the fasta file name you use as reference genome. not a directory.
DIR_FASTA='./reference.fasta'
#This scripts searches for '.gff' from in 'DIR_GFF'.
DIR_GFF='./'
#This scripts searches for '.vcf' from in 'DIR_VCF'.
DIR_VCF='./variants/'
#This scripts searches for '.tsv' from in 'DIR_TSV'.
DIR_TSV='./depth/'
#File format of your VCF files ['samtools' or 'gatk']
VCF='gatk'
#########################
In above case, the tool searches for any file from /PATH/tasuke_sample_data/ and load the file to the database.
e.g.) The tool searches for any files from '/PATH/tasuke_sample_data/variants/'. If the tool finds 'human001.vcf', it load the vcf to the table for human001 in your database.
If browsing TASUKE is extremely slow, such as "It takes several tens of seconds to scroll the track", MySQL table statistics may not have been created.
(It seems that it may occur when the database is loaded for a long time by registering a large dataset)
MySQL table statistics are normally created automatically, but you can create them manually with the script below:
$ tasuke_optimize_tables.pl -db
<database name>
-u <user>
-p <password>
-a/-v/-d/-c/-s
Either required:
-a : optimize all tables(Same as specifying "-v -d -c -s")
-v : Optimize variant information tables
-d : Optimize depth information tables
-c : Optimize general purpose track tables
-s : Optimize other system-related tables
Other options:
-h
-k : TEST mode. Check AccessionIDs, and output SQLs, but do not execute it.
If you want to update already installed TASUKE+, please see "How to update".
Installation of web contents is as simple as copying files.
After download TASUKE package, copy the contents of "tasuke_www" to Apache document root under any name.
Below is an example command.
$ tar xf ./tasuke-plus.tar
$ mkdir /var/www/html/tasuke
$ cp -r ./tasuke-plus/tasuke_www/* /var/www/html/tasuke/
Starting TASUKE
First, set below configuration at least.
Modifying conf/config.php$db = <database name>
;
$host = 'localhost' or <hostname>
;
$user = <user name>
;
$pswd = <password>
;
Access the server by web browser.
if you allocated tasuke_www/* to /(Documentroot)/tasuke/, access the following URL.
http://your_domain/tasuke
A web browser which can accept HTML5 is required. We checked the operation of TASUKE with Edge, Firefox and Google Chrome on Win and Mac.
If the TASUKE does not work, see this document.
Additional setting (Optional)
Security setting for exposing on the internet.
1. Limited-mysql-user for security protection$ mysql -u <user>
-p
> Enter password: <password>
$ mysql> create user '<new user>
'@'<hostname>
';
$ mysql> set password '<new user>
'@'<hostname>
'=password('<new password>
');
$ mysql> grant select on <database name>
.* to '<new user>
'@'<hostname>
';
$ mysql> flush privileges;
$ mysql> exit;
$user = <new user>
;
$pswd = <new password>
;
Modifying /etc/httpd/conf/httpd.conf
<Apache document root>
/conf" >Require all denied
</Directory>
<Apache document root>
/conf" >Order deny,allow
Deny from all
</Directory>
Using database compression, data size will be reduce and the performance is slightly improve. Particularly TSV (depth and general-purpose) data size will be reduce to 1/2 to 1/6.
The TASUKE (database) does not work until finishing this processes.
Compressed database can not be update (read-only).
If you want to update data after making compressed database, decompressing is needed.$ service mysqld stop
Move to the database directory$ cd <mysql database directory>
(default: /var/lib/mysql/<database name>
)
<tsv table>
indicates dx_accession or dx_accession_cstmMyisampack and myisamchk are repeated for each accession
$ myisampack -v <tsv table>
$ myisamchk -rq --sort-index --analyze <tsv table>
.MYI
$ service mysqld start
Load the tables$ mysql -u <user>
-p
> Enter password: <password>
$ mysql> flush tables;
$ mysql> exit;
$ service mysqld stop
Deompressing the tables$ myisamchk --unpack <tsv table>
$ service mysqld start
Load the tables$ mysql -u <user>
-p
> Enter password: <password>
$ mysql> flush tables;
$ mysql> exit;
How to update
This section describes how to update a TASUKE.
1. Unpack & CopyAfter download TASUKE package, set "tasuke_www" to the Apache document root.
Run the below commands, your configuration files (conf/config.php and order.conf) are overwrited. We recommend conducting a backup of your configuration file before update.
$ tar xf ./tasuke-plus.tar
$ cp -r ./tasuke-plus/tasuke_www/* <TASUKE DIRECTORY>
Since the first version of TASUKE+ (20180720), there is no DB schema change, so this operation is unnecessary.
3. Edit the configuration fileSet any items to the updated configuration file.
Alternatively, the added configuration items are documented for each version in the Release note and can be copied and pasted into the previous version's configuration file.
More detail: Configuration-page