Courses, Tutorials and Slides – lecture_read

Sequence data format (fastq)

Illumina sequencing:

Millions of sequences (= reads) are generated
single-read sequencing = one file (R1)
paired-read sequencing = two files (R1/R2)

MinION:

reads are not paired!
But saved in the same way!

Identifier (run, cluster, …)
Sequence (A, T, G, C)
“+” separator
Quality (ASCII)

For row format (fastq)

Row	Description
1st row	Identifier (run, cluster, location on flow cell, read number, etc.)
2nd row	Nucleotide sequence (A, T, G, C, N)
3rd row	Separator line (optionally repeats carries an identifier)
4th row	Quality scores for each base (ASCII-encoded Phred scores)

Read format

Example: NCBI SRA GenBank entry: SRR5560266 (single read data)

The first row provides information about the cluster, read position, location, run, … https://help.basespace.illumina.com/articles/descriptive/fastq-files/
Nucleotide sequence
Spacer
Base call quality for each nucleotide

Read format (paired reads)

Paired reads are saved in the same format as single read data
Usually paired reads are stored in two separate files, each file with the same number and order of reads
Paired reads have (almost) the same identifier

Read quality (Q) and error probability (P)

quality = probability that a base is detected falsely
each base gets its own quality score = Q score
Q = Phred quality score
P = base-calling error probability

Relationship between error probability (P) and Phred quality score (Q)

Here is the R code in case you want to create this graph by yourself in R…

library(ggplot2)

P  <- 10^seq(-4, 0, length.out = 1000)
df <- data.frame(P = P, Q = -10 * log10(P))

Qvals  <- c(10, 20, 30)
guides <- data.frame(
  Q   = Qvals,
  P   = 10^(-Qvals/10),
  col = c("red", "forestgreen", "blue"),
  lty = c("dashed", "dotted", "dotdash")
)

p <- ggplot(df, aes(x = P, y = Q)) +
  geom_line(size = 1) +
  geom_segment(data = guides,
               aes(x = 1e-4, xend = P, y = Q, yend = Q,
                   colour = col, linetype = lty),
               inherit.aes = FALSE, size = 1) +
  geom_segment(data = guides,
               aes(x = P, xend = P, y = 0, yend = Q,
                   colour = col, linetype = lty),
               inherit.aes = FALSE, size = 1) +
  geom_point(data = guides,
             aes(x = P, y = Q, colour = col),
             inherit.aes = FALSE, size = 3) +
  scale_x_log10(
    limits = c(1e-4, 1),
    breaks = c(1e-4, 1e-3, 1e-2, 1e-1, 1),
    labels = c("0.0001", "0.001", "0.01", "0.1", "1")
  ) +
  scale_y_continuous(
    limits = c(0, 45),
    breaks = seq(0, 45, by = 5),
    expand = expansion(mult = c(0, 0))
  ) +
  scale_colour_identity() +
  scale_linetype_identity() +
  labs(x = expression(P == 10^{-Q/10}), 
       y = expression(Q == -10 %.% log[10](P))) +
  theme_minimal(base_size = 14) +
  theme(
    panel.border   = element_rect(color = "black", fill = NA),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.ticks = element_line(color = "black", size = 0.5),
    axis.ticks.length = unit(0.2, "cm")
  )

ggsave("q_vs_p_plot.png",
       plot = p,
       width = 1200*f, height = 1200*f, units = "px", dpi = 300, bg="white")

Read quality: Illumina vs Nanopore data

Phred quality scores

Quality score (Q)	Probability that a base is called incorrectly	Accuracy	Error probability
10	1 in 10	90%	10%
20	1 in 100	99%	1%
30	1 in 1000	99.9%	0.1%
40	1 in 10000	99.99%	0.01%

Read quality (Q) is encoded in ASCII format

Every base in FASTQ has a quality score
FASTQ does not store numbers but ASCII characters
To geht the character: calculate Q + 33 –> position in ASCII table
Example: Q = 4 –> 37 –> %

Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII
0	1.00000	!	11	0.07943	,	22	0.00631	7	33	0.00050	B
1	0.79433	”	12	0.06310	-	23	0.00501	8	34	0.00040	C
2	0.63096	#	13	0.05012	.	24	0.00398	9	35	0.00032	D
3	0.50119	$	14	0.03981	/	25	0.00316	:	36	0.00025	E
4	0.39811	%	15	0.03162	0	26	0.00251	;	37	0.00020	F
5	0.31623	&	16	0.02512	1	27	0.00200	<	38	0.00016	G
6	0.25119	’	17	0.01995	2	28	0.00158	=	39	0.00013	H
7	0.19953	(	18	0.01585	3	29	0.00126	>	40	0.00010	I
8	0.15849	)	19	0.01259	4	30	0.00100	?	41	0.00008	J
9	0.12589	*	20	0.01000	5	31	0.00079	@	42	0.00006	K
10	0.10000	+	21	0.00794	6	32	0.00063	A

Read quality (Q) is encoded in ASCII format

Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII
0	1.00000	!	11	0.07943	,	22	0.00631	7	33	0.00050	B
1	0.79433	”	12	0.06310	-	23	0.00501	8	34	0.00040	C
2	0.63096	#	13	0.05012	.	24	0.00398	9	35	0.00032	D
3	0.50119	$	14	0.03981	/	25	0.00316	:	36	0.00025	E
4	0.39811	%	15	0.03162	0	26	0.00251	;	37	0.00020	F
5	0.31623	&	16	0.02512	1	27	0.00200	<	38	0.00016	G
6	0.25119	’	17	0.01995	2	28	0.00158	=	39	0.00013	H
7	0.19953	(	18	0.01585	3	29	0.00126	>	40	0.00010	I
8	0.15849	)	19	0.01259	4	30	0.00100	?	41	0.00008	J
9	0.12589	*	20	0.01000	5	31	0.00079	@	42	0.00006	K
10	0.10000	+	21	0.00794	6	32	0.00063	A

Read quality (Q) is encoded in ASCII format

Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII	Q	P_error	ASCII
0	1.00000	!	11	0.07943	,	22	0.00631	7	33	0.00050	B
1	0.79433	”	12	0.06310	-	23	0.00501	8	34	0.00040	C
2	0.63096	#	13	0.05012	.	24	0.00398	9	35	0.00032	D
3	0.50119	$	14	0.03981	/	25	0.00316	:	36	0.00025	E
4	0.39811	%	15	0.03162	0	26	0.00251	;	37	0.00020	F
5	0.31623	&	16	0.02512	1	27	0.00200	<	38	0.00016	G
6	0.25119	’	17	0.01995	2	28	0.00158	=	39	0.00013	H
7	0.19953	(	18	0.01585	3	29	0.00126	>	40	0.00010	I
8	0.15849	)	19	0.01259	4	30	0.00100	?	41	0.00008	J
9	0.12589	*	20	0.01000	5	31	0.00079	@	42	0.00006	K

FASTQ encoding: ASCI <-> Q score <-> read quality

Take-home message

Reads are stored in FASTQ format
One read has four lines
- Identifier
- Nucleotide sequence
- Spacer
- Nucleotide quality
Q score = measure of base call accuracy
Q is stored as ASCII characters (Phred+33)
Higher Q –> lower error probability
Quality profiles help to detect sequencing problems

reads - Storing Sequence Information

Sequence data format (fastq)

For row format (fastq)

Read format

Read format (paired reads)

Read quality (Q) and error probability (P)

Relationship between error probability (P) and Phred quality score (Q)

Read quality: Illumina vs Nanopore data

Phred quality scores

Read quality (Q) is encoded in ASCII format

Read quality (Q) is encoded in ASCII format

Read quality (Q) is encoded in ASCII format

FASTQ encoding: ASCI <-> Q score <-> read quality

Take-home message