Feel free to join the next Helmholtz Hacky Hour #26 on Wednesday, April 21, 2021 from 2PM to 3PM!

Commit 7dc9303b authored by Florian Centler's avatar Florian Centler

Update README.md

parent 3aac3bfe
# CMP
A custom-made adaptable analysis pipeline for metagenomics data
A custom-made adaptable analysis pipeline for metagenomics data, including assembly, binning, and annotation.
## Getting Started
......@@ -23,6 +22,7 @@ CMP makes use of a number of software packages, which need to be available on th
* Metaxa2 (and supporting tools)
* MetaPhlAn
* Bowtie2
* IDBA-UD
* ABySS
* MaxBin2
* CheckM
......@@ -36,10 +36,9 @@ CMP makes use of a number of software packages, which need to be available on th
If the usage of other tools is desired, they can easily be replaced.
### Installing
An installation bash script is provided which installs all required tools (except for Diamond and MEGAN) for the standard CMP (see above):
An installation bash script is provided which installs all required tools (except for IDBA-UD, Diamond, and MEGAN) for the standard CMP (see above):
```
install.sh
......@@ -47,6 +46,193 @@ install.sh
Please check comments in the script for further instructions.
After use of the installation script and/or the manual installation of all tools please follow the following steps before calling the `main.sh` script:
```
export PATH=$PATH:/pathto/pplacer-Linux-v1.1.alpha17
export PATH=$PATH:/usr/bin/blastall
export PATH=$PATH:$pathT/ncbi-blast-2.6.0+/bin
```
Furthermore, path and names need to be specified in `main.sh`:
```
path=/path/to/your/desired/Output-Folder (e.g. path=/home/user/Exp_ID)
pathT=/path/to/the/preinstalled/Tools-Folder (e.g. pathT=/home/user/Tools)
pathSH=/path/to/the/MCB-MG-Pipeline/bin (e.g. pathSH=/home/user/MCB-MGPipeline/bin)
f=number of samples (default: 3)
NOTE: Before called the array-lists please rename your original raw read files according to following pattern:
meinarrayF=(Org_ID1_F Org_ID2_F Org_ID3_F)
meinarrayR=(Org_ID1_R Org_ID2_R Org_ID3_R)
meinarrayNEW=(ID1 ID2 ID3)
Exp=Define a specific experimental name (e.g. Exp=Experiment_ID)
```
## Usage: General Overview
### Step 1: Preprocessing and clean-up of raw data
1. Create some output folder for all clean-up processes
2. Unzip the renamed original row read files
3. FastQC(1) of uncleaned row read files (forward and reverse selected)  PreQualCheck
4. Clean-up and quality scan with Trimmomatic(2) (default is PE)  CleanUp
5. Generate interleaved paired end files (Velvet(3)  “shuffleSequences”)  Interleaved_SampleID.fastq (only PE)
6. Create other output files with different content  CombinedUE_SampleID.fastq (only SE), Total_SampleID.fastq (PE+SE)
7. Calculate the read length of the total cleaned file (Total_SampleID.fastq)
8. FastQC of cleaned reads of all possible cleaned files  PostQualCheck
### Step 2: Pre-Taxonomical-Characterization
This step is integrated within the `main.sh` and provides a first insight into the composition of the microbial community using Metaxa2. It is a useful step to select the reference database for Bowtie2, to select the cleaned reads based on expected organism (the FASTA files of possible species must be downloaded manually).
### Step 3: Read Mapping
1. It starts with Bowtie2: a) Build Bowtie2-Index depending on the desired reference database (under BaseFasta), b) Mapping with Bowtie2, creating a sam file
2. Transform the sam file to bam with Samtools and create a bam-index to define the alignment per pre-defined species (species strand)
3. Prepare the generated unmapped files for the de-novo analysing step
4. Prepare the generated mapped directly for the external annotation process
5. Calculate the read length of unmapped reads in total, as base to calculate the k-mer size for the possible de-novo assembly with ABySS or only as information.
### Step 4: De novo Assembly and Binning of unmapped (unaligned) reads
1. Generated different output-folder for the next steps
2. Starts IDBA-UD with the total-unmapped read file
3. Performed a statistical analyses of the assembly with QUAST
4. Starts binning of IDBA-UD-based contigs with Maxbin2
5. Check of completeness and contamination rate of the bins with CheckM
6. Reassemble the binned reads via IDBA-UD (integrated in MaxBin2)
7. Create some folders as base for the external annotation process and copied the mapped files of Bowite2 and also the contig file of reassembly per sampleinside the respective input folders
8. Map reads back to IDBA-UD-based contigs (experimental step)
### Step 5: External Annotation with *.sub files (diamond_daa and diamond_daa2rma)
Diamond and MEGAN6 are used here to annotate the taxonomy and potential function of mapped and unmapped files.
## Running CMP
1. Perform Step 1 as described above.
2. Start the Main.sh on your command line
3. Firstly decide between manual and automatically mode
Manual mode: Gives the possibility to check the completeness of the question 4)-5)  it breaks per question and continued after the input of “y ↳” (n resulted in a script break down)
Automatically mode: You have prepared all setting, folder, reference and original file like 4)-5) before and therefor you could start the full analyzing automatically without breaks.
4. Store all original raw read files (based on Illumina sequencing) inside $path/OrgFiles (should be generated before to avoid errors during the pipeline run) and rename the files like the pattern under meinarrayF and meinarrayR
 manual mode: During the run you are asked!
5. In best case you know same reference genomes (or one important genome) for the first data-reduction step. It is the case please download all possible genomes from NCBI-databases and save the files as FASTA inside the folder $path/Reference/BaseFasta (folder and subfolder should be generated before).
If it not the case start the Main.sh in manual mode without pre-stored reference genomes and following the instructions (manual mode)!
 During the run you are asked! (Pre-Taxonomical Characterization)
NOTE: Only take high abundance species (> 30 matches) as reference (maybe based on metaxa-output) or take information from other methods, like 16S rRNA analyzes or mcrA analysis. As reference also you can use only one reference genomes!
 If you response one question with no (n), the pipeline will be canceled!
6. Regardless of manual or automatically run, after all you will be asked to delete unnecessarily intermediated results
 If you say yes some results are compressed and other are deleted. Only the necessary files for analyzing are left.
### Interpreting Output
#### CMP
During the analyzing a lot of intermediated results are generated. This is useful to check the faultless analyzing in general and to use intermediated results for other external analyzing parts.
Inside the $path/ defined folder, the following folder with the subsequently content will be created:
$path/OrgFiles  Will be created by yourself or during the first part of analyzing and contains all original row read files renamed after the pattern Org_IDx_F.fastq(.gz) and Org_IDx_R.fastq(.gz)
#############################################################################################################
$path/Reference  Will be created by yourself or during the first part of analyzing and contains the folder $path/Reference/BaseFasta. Here your reference file or files in FASTA are contained. Later some log files are generated as base for the Index-building by Bowtie2.
#############################################################################################################
$path/PreQualCheck  Contains the FastQC generated output files of original raw read files.
#############################################################################################################
$path/CleanUp  Contains all cleaned and trimmed files. These files are the base for all further steps.
 $path/CleanUp/Clean_IDx contains the Trimmomatic output.
(1P|2P=forward paired and reverse paired, 1U|2U=forward unpaired and reverse unpaired)
 Additional $path/CleanUp/ contains the following file varriants:
- CombinedUP_IDx.fastq (only unpaired clean reads)
- InterlavedPE_IDx.fastq/fasta (only paired end clean reads)
- Total_IDx.fastq (paired and unpaired clean reads)
- read Dist_clean_IDx.txt (read length distribution of cleaned reads per ID)
- trim_logfile
#############################################################################################################
$path/PostQualCheck  After the CleanUp-step all generated files will be post-checked with FastQC again.
#############################################################################################################
$path/PreTax  Contains a lot of Metaxa based output files to analyze the community as base for the reference database or only for interesting. (Important: ttt level 7 output)
#############################################################################################################
$path/Bowtie2  Here all output of the pre-mapping step based on yourself defined reference database (to reduce the data amount)
- $path/Bowtie2/IDX (all Bowtie2 indexes)
- $path/Bowtie2/IDx (Bowtie2 output of bam and sam files + logfile that contains the mapping statistics + idx_stats per
sample that contains a tab of the genome/strain that mapped (Name|length|mapped|unmapped)
- $path/Bowtie2/Mappingfiles (Contains only the mapped reads per ID in different formats)
- $path/Bowtie2/Unmappedfiles (Contains only the unmapped reads per ID in different formats + read length
distribution files selected for all unmapped reads or only paired end (interleaved) reads)
 for next steps only the interleaved file will be important
- $path/Bowtie2/idx_inspect_logfile.txt (Contains the content of bowtie-index)
- $path/Bowtie2/idx_logfile.txt
#############################################################################################################
$path/DeNovo  That part is only for unmapped reads. The main folder combines all output folders of the de-novo analyzing step:
- $path/DeNovo/Assembly (Contains only the assembly output and statistics of assembly)
- $path/DeNovo/Assembly/IDBA (IDBA_UD output per sample, contig.fa is the final file)
- $path/DeNovo/Assembly/Quast (Statistical analyze of the assembly, summarized on report.pdf)
- $path/DeNovo/Bin_Contig (Contains only the binning output and statistics of assembly)
- $path/DeNovo/Assembly/IDx (MaxBin2 output per sample)
- $path/DeNovo/Assembly/CheckM (Quality check of bins per sample (if it is possible), all important
information are saved in logfile_checkm_IDx per sample)
- $path/DeNovo/REAssembly (Contains only the reassembly outputs and post statistics of Reassembly with IDBA_UD)
- $path/DeNovo/Assembly/IDx_reas (Reassembly results under IDx_idba_* and final file (combined all
contigs per bin in one analyze file under RAContigs perBin)
- $path/DeNovo/Assembly/PostRA_QUAST (Statistical analyze of the assembly, summary on report.pdf)
#############################################################################################################
$path/Annotation  Contains all files, which is the base input for external annotation with Diamond and MEGAN6, per sample (grouped mapped/unmapped)
#### External annotation with Diamond and MEGAN6
To finalize our analyzing process, the annotation steps must be performed as external analyzing. The reason is that DIAMOND needs a lot of power to generate an NCBI-database based mapping file per sample. Therefore we use the EVE-Cluster.
The first step for annotation is to generate an nr.dmnd file based on the nr.gz (downloaded from NCBI) with Diamond. This step must be performed only at first time, after you should use for every run the same nr file:
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
diamond makedb –in /path/to/nr.gz –db /path/to/store/nr
After generation the nr.dmnd file the blastx run of diamond can started. Therefor use the diamond_daa_maske.sub inside the subScript folder of MCB-MG-Pipeline. Inside the maske.sub transform the /data/…/temp/ path like your desired path to a temp folder (generate a temp folder under your desired path). You can use this path for every run and therefor it must be transformed only on first time. After optimize the sub-file, use the following command on EVE-Cluster. NOTE: IDx must be renamed with your individual sample-IDs (the input is found under $path/Annotation/IDx/mapped_annotation/input and $path/Annotation/IDx/unmapped_annotation/input):
qsub –l highmem diamond_daa_maske.sub blastx –q /path-to-input/all_mapped_IDx_total.fasta –d /path-to/nr –a /path-to-outputfolder/all_mapped_IDx_total.daa
qsub –l highmem diamond_daa_maske.sub blastx –q /path-to-input/unm_IDx_RAContig_total.fasta –d /path-to/nr –a /path-to-outputfolder/unm_IDx_RAContig_total.daa
Now the *.daa file must be transformed in a *.rma file, to use MEGAN6. There are also a diamond_daa2rma_maske.sub under the subScript folder of MCB-MG-Pipeline. Also the path must be transformed like the path of the first sub file.
qsub –l highmem diamond_daa2rma_maske.sub –i /path-to-outputfolder/unm_IDx_RAContig_total.daa –o /path-to-outputfolder/unm_IDx_RAContig_total.rma –a2t /data/umbsysbio/Dani/DB/prot_acc2tax-May2017.abin -a2eggnog /data/umbsysbio/Dani/DB/acc2eggnog-Oct2016X.abin -fun EGGNOG
qsub –l highmem diamond_daa2rma_maske.sub –i /path-to-outputfolder/all_mapped_IDx_total.daa –lg –o /path-to-outputfolder/all_mapped_IDx_total.rma –a2t /data/umbsysbio/Dani/DB/prot_acc2tax-May2017.abin -a2eggnog /data/umbsysbio/Dani/DB/acc2eggnog-Oct2016X.abin -fun EGGNOG
-lg means long reads, because the unmapped reads are contigs
-pof could be also an addition setting for paired end in one file
-fwa (first word is accession)
After all preparation steps the *.rma file can uploaded on MEGAN6 with Open . Look at the MEGAN-Manual to find out, which possibilities of analyzing are given.
#### Analyzing the whole community
Depending on the relation between mapped and unmapped reads, the mapped reads are low. Because we only show on the first alignment of reads. To reconstruct the whole community in optimal relation, please calculate like following:
1) Calculate the community per species (or other level) one the one hand for mapped and on the other hand for unmapped, based on MEGAN6 taxonomy. E.g:
mapped_output: unmapped_output:
Species1 sum 1232  93,05% Species1 sum 12032  92,90%
Species2 sum 90  6,80% Species2 sum 900  6,95%
SpeciesN sum 2  0,15% SpeciesN sum 20  0,15%
Sum: 1324 Sum: 12952
2) Now you must look at the bowtie2 logfile to check the “overall alignment rate” [OAR]. E.g
OAR: 31 %  unmapped 100%-31%=69%
For the species in mapped file, follow the next step:
(OAR * map_Species%) / 100 = e.g. (31* 93,05) / 100 = 28,85 %
For the species of unmapped file, follow:
((100-OAR) * unm_Species%) / 100 = (69 * 92,90) / 100 = 64,10
#########################################################################################################
#Additional analyzing with ABySS#
#########################################################################################################
If you like to use ABySS as assembler, it is more complicated, but there are executable files to use.
Under MCB-MG-Pipeline/bin/SingleScripts there are three additional scripts:
- check_builded_fastq_script: (that could be use if your FastQ file is invalid, maybe for mock datasets)
$pathSH/SingleScripts/check_builded_fastq_script < input.fastq > checked_output.fastq
- Pre-KmerCheck.sh: If you like to use ABySS with optimal k-mer values, then you must check it before. This script gives you all
output for k-mer values between 51 and 248 (step size 4). After all a stat_total_IDx under kcheck_unmap/unmap_IDx_kcheck
should be generated with all important values for the k-mer-comparison. In best case the N50 value should be high and the
maximal contig length should be a good compromise!
 Before starting the single script please redefine path, f and meinarrayNew in the header of the script!
- Only_ABYSS.sh: With this script you can run only ABySS based assembly, binning and reassembly also with ABySS. The output is
marked with the surname “abyss” under the Assembly-folder.
 Before starting the single script please re-define path, pathT, pathSH, f, meinarray*, Exp, kdef (based on PreCheck or after rule
of the thumb: average read length of unmapped reads per sample – 10 bp) in the header of the script!
- R-Script: Combine function and taxonomy of MEGAN6-output (isn’t ready)
## Author
* **Daniela Becker**, UFZ - Helmholtz Centre for Environmental Research, Leipzig, Germany
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment