- When finding a possible spot, compare it with multiple
samples. - Calculate the mutation rate of your finding, and compare it
with data in authoritative publications or databases.
2.2 Procedure for
NGS Data Analysis
2.2.1 Quality Control
When it comes to analyzing the results of next-generation DNA
sequencing (NGS) data, the situation is more complicated. This is
because the results are determined by varied DNA library con-
structing process and adaptors-adding process. Since the modern
high-throughput sequencers can generate hundreds of millions of
sequences in a single run, before analyzing this sequence to draw
biological conclusions, we are prone to perform some simple qual-
ity control checks to ensure that the raw data looks good and there
are no problems or biases in the data.
Although many sequencers will generate a QC report, this is
usually not enough since it only focused on identifying problems
which were generated by the sequencer itself. FastQC is a widely
used software that aims to provide a more detailed QC report,
which can spot problems which originate either in the sequencer
or in the starting library material. When using FastQC, we should
know the following steps:
- Use the Linux system and install FastQC:
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). - Type in command “fastqc [-o output dir] [--(no)extract] [-f
fastq|bam|sam] [-c contaminant file].” “output dir” means the
output path, the parameter “extract” determines the output
unpacking, and the parameter “-f” represents the format of
input. - Run FastQC and read the result files:
l The HTML report shows a summary of the modules which
were run and a quick evaluation of whether the results of the
module seem entirely normal (green tick), slightly abnormal
(orange triangle), or very unusual (red cross).
l View the per base sequence quality. Quality can be seen as
the value of Fred. In “ 10 log10(p),” “p” stands for the
possibility of a mistake. Values of the lower quartile and the
median should be considered. If the value of the lower
quartile exceeds 30, the quality can be regarded as
very good.
l View the per sequence quality scores. Normally, if 90% of
the reads have the quality value of more than 35 scores, the
quality can be regarded as very good.
l View the distribution of A,T,G,C. In most cases, the
amount of A/T (28%) outweighs that of G/C (22%).
6 Keyi Long et al.