Feel free to join the next Helmholtz Hacky Hour #26 on Wednesday, April 21, 2021 from 2PM to 3PM!

Commit 41ef3421 authored by Florian Centler's avatar Florian Centler

Update README.md

parent 96b07b73
......@@ -70,6 +70,11 @@ Exp=Define a specific experimental name (e.g. Exp=Experiment_ID)
## Usage: General Overview
CMP consists of three Bash scripts which are all called from the main script `main.sh`:
1. `cleanup.sh` takes care of the clean-up of raw read files. Parameters for Trimmomatic can be adjusted here.
2. `analyse_map.sh` does the pre-mapping in case of reference genomes are used. Corresponding FASTA genomes must be stored under `$path/Reference/BaseFasta`.
3. `analyse_denovo.sh` implements assembly, binning, reassembly, and mapping of reads against contigs. Parameters for these steps can be adjusted here.
### Step 1: Preprocessing and clean-up of raw data
1. Create some output folder for all clean-up processes
2. Unzip the renamed original raw read files
......@@ -107,71 +112,54 @@ Diamond and MEGAN6 are used here to annotate the taxonomy and potential function
1. Perform Step 1 as described above.
2. Start the `main.sh` on your command line
3. Firstly decide between manual and automatic mode: **Manual mode:** Gives the possibility to check the completeness of the question 4)-5); it breaks per question and continued after the input of “y ↳” (n resulted in a script break down)
**Automatic mode:** You have prepared all setting, folder, reference and original file like 4)-5) before and therefor you could start the full analyzing automatically without breaks.
4. Store all original raw read files (based on Illumina sequencing) inside $path/OrgFiles (should be generated before to avoid errors during the pipeline run) and rename the files like the pattern under meinarrayF and meinarrayR
 manual mode: During the run you are asked!
5. In best case you know same reference genomes (or one important genome) for the first data-reduction step. It is the case please download all possible genomes from NCBI-databases and save the files as FASTA inside the folder $path/Reference/BaseFasta (folder and subfolder should be generated before).
If it not the case start the Main.sh in manual mode without pre-stored reference genomes and following the instructions (manual mode)!
 During the run you are asked! (Pre-Taxonomical Characterization)
NOTE: Only take high abundance species (> 30 matches) as reference (maybe based on metaxa-output) or take information from other methods, like 16S rRNA analyzes or mcrA analysis. As reference also you can use only one reference genomes!
 If you response one question with no (n), the pipeline will be canceled!
6. Regardless of manual or automatically run, after all you will be asked to delete unnecessarily intermediated results
 If you say yes some results are compressed and other are deleted. Only the necessary files for analyzing are left.
### Interpreting Output
#### CMP
During the analyzing a lot of intermediated results are generated. This is useful to check the faultless analyzing in general and to use intermediated results for other external analyzing parts.
Inside the $path/ defined folder, the following folder with the subsequently content will be created:
$path/OrgFiles  Will be created by yourself or during the first part of analyzing and contains all original row read files renamed after the pattern Org_IDx_F.fastq(.gz) and Org_IDx_R.fastq(.gz)
#############################################################################################################
$path/Reference  Will be created by yourself or during the first part of analyzing and contains the folder $path/Reference/BaseFasta. Here your reference file or files in FASTA are contained. Later some log files are generated as base for the Index-building by Bowtie2.
#############################################################################################################
$path/PreQualCheck  Contains the FastQC generated output files of original raw read files.
#############################################################################################################
$path/CleanUp  Contains all cleaned and trimmed files. These files are the base for all further steps.
 $path/CleanUp/Clean_IDx contains the Trimmomatic output.
(1P|2P=forward paired and reverse paired, 1U|2U=forward unpaired and reverse unpaired)
 Additional $path/CleanUp/ contains the following file varriants:
- CombinedUP_IDx.fastq (only unpaired clean reads)
- InterlavedPE_IDx.fastq/fasta (only paired end clean reads)
- Total_IDx.fastq (paired and unpaired clean reads)
- read Dist_clean_IDx.txt (read length distribution of cleaned reads per ID)
- trim_logfile
#############################################################################################################
$path/PostQualCheck  After the CleanUp-step all generated files will be post-checked with FastQC again.
#############################################################################################################
$path/PreTax  Contains a lot of Metaxa based output files to analyze the community as base for the reference database or only for interesting. (Important: ttt level 7 output)
#############################################################################################################
$path/Bowtie2  Here all output of the pre-mapping step based on yourself defined reference database (to reduce the data amount)
- $path/Bowtie2/IDX (all Bowtie2 indexes)
- $path/Bowtie2/IDx (Bowtie2 output of bam and sam files + logfile that contains the mapping statistics + idx_stats per
sample that contains a tab of the genome/strain that mapped (Name|length|mapped|unmapped)
- $path/Bowtie2/Mappingfiles (Contains only the mapped reads per ID in different formats)
- $path/Bowtie2/Unmappedfiles (Contains only the unmapped reads per ID in different formats + read length
distribution files selected for all unmapped reads or only paired end (interleaved) reads)
 for next steps only the interleaved file will be important
- $path/Bowtie2/idx_inspect_logfile.txt (Contains the content of bowtie-index)
- $path/Bowtie2/idx_logfile.txt
#############################################################################################################
$path/DeNovo  That part is only for unmapped reads. The main folder combines all output folders of the de-novo analyzing step:
- $path/DeNovo/Assembly (Contains only the assembly output and statistics of assembly)
- $path/DeNovo/Assembly/IDBA (IDBA_UD output per sample, contig.fa is the final file)
- $path/DeNovo/Assembly/Quast (Statistical analyze of the assembly, summarized on report.pdf)
- $path/DeNovo/Bin_Contig (Contains only the binning output and statistics of assembly)
- $path/DeNovo/Assembly/IDx (MaxBin2 output per sample)
- $path/DeNovo/Assembly/CheckM (Quality check of bins per sample (if it is possible), all important
information are saved in logfile_checkm_IDx per sample)
- $path/DeNovo/REAssembly (Contains only the reassembly outputs and post statistics of Reassembly with IDBA_UD)
- $path/DeNovo/Assembly/IDx_reas (Reassembly results under IDx_idba_* and final file (combined all
contigs per bin in one analyze file under RAContigs perBin)
- $path/DeNovo/Assembly/PostRA_QUAST (Statistical analyze of the assembly, summary on report.pdf)
#############################################################################################################
$path/Annotation  Contains all files, which is the base input for external annotation with Diamond and MEGAN6, per sample (grouped mapped/unmapped)
#### External annotation with Diamond and MEGAN6
3. Firstly decide between manual and automatic mode: **Manual mode:** Gives the possibility to verify that data is correctly set up (Points 4. and 5.); the user is guided through the process by a series of questions, with answering "no" to any question will exit the script.
**Automatic mode:** You have prepared all settings, folders, references, and original files (Points 4. und 5.) before and therefor you could start the full analyzing automatically without further user intervention.
4. Store all original raw read files (based on Illumina sequencing) inside `$path/OrgFiles` (should be generated before to avoid errors during the pipeline run) and rename the files as in the pattern under meinarrayF and meinarrayR.
5. In best case you have one or more reference genome(s) for the first data-reduction step. In this case please download the respective genomes from the NCBI-databases and save the files as FASTA files inside the folder `$path/Reference/BaseFasta` (folder and subfolder should be generated beforehand).
If it is not the case start the `main.sh` in manual mode without pre-stored reference genomes and follow the manual mode instructions (Pre-Taxonomical Characterization)!
NOTE: Only take high abundance species (> 30 matches) as reference (maybe based on metaxa-output) or take information from other methods, like 16S rRNA analysis or mcrA analysis. As reference also you can use only one reference genomes!
6. Regardless of manual or automatic mode, at the end of analysis you will be asked if intermediate results should be deleted.
## Interpreting Output
### CMP
During the analysis many intermediated results are generated which can be used to check for the correctness of pipeline steps and for further analysis.
Inside the `$path/` folder, the following sub-folders will be created:
* `$path/OrgFiles` will be created by yourself or during the first part of the analysis and contains all original raw read files renamed according to the pattern "Org_IDx_F.fastq(.gz)" and "Org_IDx_R.fastq(.gz)".
* `$path/Reference` will be created by yourself or during the first part of the analysis and contains the folder `$path/Reference/BaseFasta`. Here your reference genome file(s) in FASTA format should be stored. Later, some log files are generated here as the base for the Index-building by Bowtie2.
* `$path/PreQualCheck` contains the output files generated by FastQC on the original raw read files.
* `$path/CleanUp` contains all cleaned and trimmed files. These files are the base for all further steps.
* `$path/CleanUp/Clean_IDx` contains the Trimmomatic output. (1P|2P=forward paired and reverse paired, 1U|2U=forward unpaired and reverse unpaired)
* Additionaly, `$path/CleanUp/` contains the following files:
- `CombinedUP_IDx.fastq` (only unpaired clean reads)
- `InterlavedPE_IDx.fastq/fasta` (only paired end clean reads)
- `Total_IDx.fastq` (paired and unpaired clean reads)
- `read Dist_clean_IDx.txt` (read length distribution of cleaned reads per ID)
- `trim_logfile`
* `$path/PostQualCheck` contains the FastQC results obtained for the cleaned reads.
* `$path/PreTax` contains Metaxa2 output files, which can be used to select species for the reference database. (Important: ttt level 7 output)
* `$path/Bowtie2` contains output of the pre-mapping step based on the user-defined reference database
- `$path/Bowtie2/IDX`: all Bowtie2 indexes
- `$path/Bowtie2/IDx`: Bowtie2 output of bam and sam files and logfile that contains the mapping statistics + idx_stats per sample that contains a tab of the genome/strain that mapped (Name|length|mapped|unmapped)
- `$path/Bowtie2/Mappingfiles`: Contains only the mapped reads per ID in different formats.
- `$path/Bowtie2/Unmappedfiles`: Contains only the unmapped reads per ID in different formats and read length distribution files selected for all unmapped reads or only paired end (interleaved) reads; for next steps only the interleaved file will be important
- `$path/Bowtie2/idx_inspect_logfile.txt`: Contains the content of bowtie-index.
- `$path/Bowtie2/idx_logfile.txt`: Contains log information.
* `$path/DeNovo`: This part is only for unmapped reads. The main folder combines all output folders of the de-novo analysis step:
- `$path/DeNovo/Assembly`: Contains only the assembly output and statistics of the assembly.
- `$path/DeNovo/Assembly/IDBA`: IDBA_UD output per sample, contig.fa is the final file.
- `$path/DeNovo/Assembly/Quast`: Statistical analysis of the assembly, summarized in `report.pdf`.
- `$path/DeNovo/Assembly/IDx`: MaxBin2 output per sample.
- `$path/DeNovo/Assembly/CheckM`: Quality check of bins per sample (if it is possible), all important information are saved in `logfile_checkm_IDx` per sample.
- `$path/DeNovo/Bin_Contig`: Contains only the binning output and statistics of assembly.
- `$path/DeNovo/REAssembly`: Contains only the reassembly outputs and post statistics of reassembly with IDBA-UD.
- `$path/DeNovo/Assembly/IDx_reas`: Reassembly results under `IDx_idba_*` and final file (combining all contigs per bin in one file under `RAContigs_perBin`.
- `$path/DeNovo/Assembly/PostRA_QUAST`: Statistical analysis of the assembly, summary in `report.pdf`.
* `$path/Annotation` contains all files which are the base input for external annotation with Diamond and MEGAN6, per sample (grouped mapped/unmapped).
### External annotation with Diamond and MEGAN6
To finalize our analyzing process, the annotation steps must be performed as external analyzing. The reason is that DIAMOND needs a lot of power to generate an NCBI-database based mapping file per sample. Therefore we use the EVE-Cluster.
The first step for annotation is to generate an nr.dmnd file based on the nr.gz (downloaded from NCBI) with Diamond. This step must be performed only at first time, after you should use for every run the same nr file:
......@@ -193,7 +181,7 @@ qsub –l highmem diamond_daa2rma_maske.sub –i /path-to-outputfolder/all_mappe
After all preparation steps the *.rma file can uploaded on MEGAN6 with Open . Look at the MEGAN-Manual to find out, which possibilities of analyzing are given.
#### Analyzing the whole community
### Analyzing the whole community
Depending on the relation between mapped and unmapped reads, the mapped reads are low. Because we only show on the first alignment of reads. To reconstruct the whole community in optimal relation, please calculate like following:
1) Calculate the community per species (or other level) one the one hand for mapped and on the other hand for unmapped, based on MEGAN6 taxonomy. E.g:
mapped_output: unmapped_output:
......@@ -228,10 +216,6 @@ Under MCB-MG-Pipeline/bin/SingleScripts there are three additional scripts:
- R-Script: Combine function and taxonomy of MEGAN6-output (isn’t ready)
## Author
* **Daniela Becker**, UFZ - Helmholtz Centre for Environmental Research, Leipzig, Germany
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment