Tesi etd-11052023-104126

Tipo di tesi

Tesi di laurea magistrale

Autore

SOLERTE, DAMIANO

URN

etd-11052023-104126

Titolo

Apache Spark GPU: Elaborazione rapida di dati su GPU attraverso le librerie RAPIDS

Titolo in inglese

Apache Spark GPU: Fast data processing on GPUs using RAPIDS libraries

Struttura

Dipartimento di Ingegneria "Enzo Ferrari"

Corso di studi

Ingegneria informatica (D.M.270/04)

Commissione

Nome Commissario	Qualifica
GUERRA FRANCESCO	Primo relatore

Parole chiave

Apache Spark
Big Data
ETL
GPU
RAPIDS

Data inizio appello

2023-12-05

Disponibilità

Accesso limitato: si può decidere quali file della tesi rendere accessibili. Disponibilità mixed (scegli questa opzione se vuoi rendere inaccessibili tutti i file della tesi o parte di essi)

Data di rilascio

2063-12-05

Riassunto analitico

Analisi di dati attraverso la tecnologia di Apache Spark, sfruttando la potenza della GPU, tramite le opportune librerie da poco implementate, dimostrando la loro potenza con gli ultimi rilasci e aggiornamenti da parte di Apache e paragonando le prestazioni ottenute tra CPU e GPU.
La mole di dati utilizzata è un dataset pubblico che rappresenta tutte le transazioni avvenuta sulla blockchain della cripto valuta Ethereum, analizzando anche alcune informazioni sul meccanismo della blockchain stessa. I dati sono stati scaricati attraverso il sito www.blockchair.com che pubblicamente fornisce i files utilizzati.
Oltre a fare un paragone con quanto fatto finora nella ricerca dell'analisi dei dati, si è cercato di testare e comprendere quali sono i limiti odierni che ha raggiunto Spark.
Nel codice scritto è stato effettuato un processo di ETL per ricavare informazioni statistiche dai dati inerenti le transazioni, cercare di trovare eventuali outliers e analisi di valori nulli.
Infine si è cercato di sfruttare a pieno le librerie di Spark per generare un grafo diretto che mappasse tutte le transazioni degli utenti della blockchain.
Il tutto è stato testato su una macchina cloud di Amazon Web Services, per cercare anche di poter sfruttare a pieno la potenza computazionale, oltrepassando uno dei coli di bottiglia, riscontrati spesso nella ricerca, inerente alla gestione dei dati e alla scalabilità delle applicazioni basate su Apache Spark.

Abstract

Data analysis through the technology of Apache Spark, exploiting the power of the GPU, through the appropriate bookstores recently implemented, demonstrating their power with the latest releases and updates from Apache and comparing the performances obtained between CPU and GPU. The amount of data used is a public dataset that represents all the transactions that took place on the blockchain of the crypto Ethereum currency, also analyzing some information on the blockchain mechanism itself. The data have been downloaded through the website www.blockchair.com which publicly provides the files used. In addition to making a comparison with what has been done so far in the search for the analysis of the data, we have tried to test and understand what are the limits that has achieved Spark. In the written code, an ETL process was carried out to obtain statistical information from the data relating to transactions, try to find any outliers and analysis of null values. Finally, we tried to fully exploit Spark's bookstores to generate a direct graph that mapped all the transactions of blockchain users. Everything has been tested on a cloud machine of Amazon Web Services, to also try to fully exploit computational power, exceeding one of the bottle colors, often found in the research, inherent in the management of data and the scalability of applications based on Apache Spark.

File

Nome file	Dimensione	Tempo di download stimato (Ore:Minuti:Secondi)
Nome file	Dimensione	28.8 Modem	56K Modem	ISDN (64 Kb)	ISDN (128 Kb)	piu' di 128 Kb
Ci sono 1 file riservati su richiesta dell'autore.
Contatta l'autore