Tesi etd-09172014-130904

Tipo di tesi

Tesi di laurea magistrale

Autore

SEVERINI, CHIARA

URN

etd-09172014-130904

Titolo

Detecting genomic variations using NGS data: methods and biases correction strategies

Titolo in inglese

Struttura

Dipartimento di Scienze Fisiche, Informatiche e Matematiche

Corso di studi

MATEMATICA (D.M. 270/04)

Commissione

Nome Commissario	Qualifica
LEONCINI MAURO	Primo relatore
MONTANGERO MANUELA	Secondo relatore

Parole chiave

GC-content
NGS-data
RC-biases
Read-Count
Structural-Variants

Data inizio appello

2014-10-09

Disponibilità

Accesso limitato: si può decidere quali file della tesi rendere accessibili. Disponibilità mixed (scegli questa opzione se vuoi rendere inaccessibili tutti i file della tesi o parte di essi)

Data di rilascio

2054-10-09

Riassunto analitico

Le variazioni strutturali del genoma sono eventi che modificano la sequenza delle basi azotate del DNA di un individuo. Rilevare tali alterazioni è di fondamentale importanza dal momento che possono essere associate alla presenza di malattie dell’individuo stesso o all’evoluzione dell’intera specie.
In questo lavoro passiamo in rassegna i diversi metodi disponibili per il rilevamento delle variazioni strutturali a partire dai reads prodotti dalle piattaforme di sequenziamento di nuova generazione (NGS), concentrandoci in particolare sul metodo Read Count (RC). Quest’ultimo è in grado di rilevare le Copy Number Variations (CNVs), ovvero le variazioni strutturali che cambiano il numero dei nucleotidi nel genoma (inserzioni, delezioni o duplicazioni di sequenze di DNA). Questa strategia consiste nel partizionare il genoma di riferimento in finestre non sovrapposte e determinare quindi il numero di reads allineati in ognuna di esse (stima dei RC) al fine di trovare le regioni nelle quali questi numeri differiscono dal valore atteso. Tuttavia, se in teoria i RCs seguono una distribuzione di Poisson, nella pratica ci sono alcuni bias che richiedono una fase di normalizzazione prima di procedere al rilevamento. Le principali fonti di bias che alterano la distribuzione sono: l’ambiguo allineamento dei reads (dovuta alla presenza di aree ripetute nel genoma di riferimento e alla contenuta lunghezza dei reads NGS) e il local GC content (ossia la quantità di basi guanina o citosina in una regione genomica). In questa tesi presentiamo i metodi esistenti per affrontare le sfide statistiche associate al rilevamento delle CNVs a partire da reads NGS e utilizzando il metodo RC. In particolare ci concentriamo su un metodo sviluppato nel 2012 che, basandosi sul GC content di una finestra del genoma, è in grado di fornire una previsione del read count medio per ogni posizione del genoma stesso e di utilizzarla per ottenere la stima del copy number.
Abbiamo quindi scritto un codice MATLAB per questa strategia e l'abbiamo valutata utilizzando un dataset sintetico.

Abstract

Genomic Structural Variations (SVs) are events that alter the sequence of nitrogenous bases of the DNA of an individual. Detecting such events is crucial because they may be related with individual diseases or with the evolution of the entire species. In this work we review different available methods to detect SVs from the reads produced by Next-Generation Sequencing (NGS) platforms, focusing in particular on the Read Count (RC) approach. The RC method is able to detect the structural variations that change the number of the nucleotides in the genome, which are referred as Copy Number Variations (CNVs) and can be insertions, deletions or duplications of DNA sequences. The main strategy consists in partitioning the reference genome in non-overlapping windows and then determining the number of aligned reads in each of them (RC estimation) in order to find regions in which it is different from the expected one. However, if in theory RCs follow a Poisson distribution, in practice there are some biases which require normalization before detection. The main sources of biases that affect the distribution are: the ambiguous reads mappability (due to the repetitive areas of the reference genome and the short length of the NGS reads) and the local GC content (namely the amount of either guanine or cytosine bases in a genomic region). Thus, in this work we study existing methods to face the statistical challenges related with detecting CNVs from NGS data using RC methods. In particular we focus on a method, developed in 2012, for the prediction of the mean read count for any genomic position according to the GC content of a genomic window and uses this prediction to perform the copy number estimation. We wrote a MATLAB code for this strategy and evaluated it on a synthetic dataset.

File

Nome file	Dimensione	Tempo di download stimato (Ore:Minuti:Secondi)
Nome file	Dimensione	28.8 Modem	56K Modem	ISDN (64 Kb)	ISDN (128 Kb)	piu' di 128 Kb
Ci sono 1 file riservati su richiesta dell'autore.
Contatta l'autore