Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1

  1. When finding a possible spot, compare it with multiple
    samples.

  2. Calculate the mutation rate of your finding, and compare it
    with data in authoritative publications or databases.


2.2 Procedure for
NGS Data Analysis


2.2.1 Quality Control


When it comes to analyzing the results of next-generation DNA
sequencing (NGS) data, the situation is more complicated. This is
because the results are determined by varied DNA library con-
structing process and adaptors-adding process. Since the modern
high-throughput sequencers can generate hundreds of millions of
sequences in a single run, before analyzing this sequence to draw
biological conclusions, we are prone to perform some simple qual-
ity control checks to ensure that the raw data looks good and there
are no problems or biases in the data.
Although many sequencers will generate a QC report, this is
usually not enough since it only focused on identifying problems
which were generated by the sequencer itself. FastQC is a widely
used software that aims to provide a more detailed QC report,
which can spot problems which originate either in the sequencer
or in the starting library material. When using FastQC, we should
know the following steps:


  1. Use the Linux system and install FastQC:
    (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

  2. Type in command “fastqc [-o output dir] [--(no)extract] [-f
    fastq|bam|sam] [-c contaminant file].” “output dir” means the
    output path, the parameter “extract” determines the output
    unpacking, and the parameter “-f” represents the format of
    input.

  3. Run FastQC and read the result files:
    l The HTML report shows a summary of the modules which
    were run and a quick evaluation of whether the results of the
    module seem entirely normal (green tick), slightly abnormal
    (orange triangle), or very unusual (red cross).
    l View the per base sequence quality. Quality can be seen as
    the value of Fred. In “ 10 log10(p),” “p” stands for the
    possibility of a mistake. Values of the lower quartile and the
    median should be considered. If the value of the lower
    quartile exceeds 30, the quality can be regarded as
    very good.
    l View the per sequence quality scores. Normally, if 90% of
    the reads have the quality value of more than 35 scores, the
    quality can be regarded as very good.
    l View the distribution of A,T,G,C. In most cases, the
    amount of A/T (28%) outweighs that of G/C (22%).


6 Keyi Long et al.

Free download pdf