Tesi etd-09282022-172321

Tipo di tesi

Tesi di laurea magistrale

Autore

VIGLIANISI, BRANDON WILLY

URN

etd-09282022-172321

Titolo

Rivisitazioni del token mixer in architetture Transformer

Titolo in inglese

Revisiting token mixing in Transformer architectures

Struttura

Dipartimento di Ingegneria "Enzo Ferrari"

Corso di studi

Ingegneria informatica (D.M.270/04)

Commissione

Nome Commissario	Qualifica
BARALDI LORENZO	Primo relatore

Parole chiave

Attention
Computer Vision
Neural Network
Token mixer
Transformer

Data inizio appello

2022-10-20

Disponibilità

Accesso limitato: si può decidere quali file della tesi rendere accessibili. Disponibilità mixed (scegli questa opzione se vuoi rendere inaccessibili tutti i file della tesi o parte di essi)

Data di rilascio

2062-10-20

Riassunto analitico

Il Transformer, un'architettura nata per le attività di NLP (Elaborazione del linguaggio naturale), è lentamente diventata lo standard de facto nel campo della visione artificiale, sostituendo l'approccio convoluzionale standard con l'operatore dell'attenzione. Una credenza comune è che il modulo di mix dei token basato sull'attenzione contribuisca maggiormente ai loro risultati. Tuttavia, lavori recenti hanno dimostrato che il modulo basato sull'attenzione nei Transformers può essere sostituito da MLP spaziali senza che le performance finali ne risentano eccessivamente. Partendo dal paper "Metaformer is actually what you need for vision" (Metaformer è in realtà ciò di cui hai bisogno per la visione), che dimostra che è l'architettura generale dei Transformers, e non il modulo specifico che implementa l'operazione di token mixing, la parte più importante dell'architettura. In questo lavoro, abbiamo testato diverse strategie implementando e valutando moduli di token mixing con lo scopo di ottenere prestazioni migliori con meno parametri rispetto al classico Vision Transformer.

Abstract

The Transformer, an architecture born for NLP (Natural Language Processing) task, is slowly become the standard de facto in Computer Vision tasks, replacing the standard convolutional approach with the attention operator. A common belief is that attention-based token mixer module contributes most to their competence. However, recent works has shown that attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Starting from the paper "Metaformer is actually what you need for vision", that demonstrate that is the general architecture of the Transformers, instead of the specific token mixer module, the most important part of the architecture. In this work, we tested different strategies implementing and evaluating token mixer modules with the purpose to achieve better performances with fewer parameters compared to the classical Vision Transformer.

File

Nome file	Dimensione	Tempo di download stimato (Ore:Minuti:Secondi)
Nome file	Dimensione	28.8 Modem	56K Modem	ISDN (64 Kb)	ISDN (128 Kb)	piu' di 128 Kb
Ci sono 1 file riservati su richiesta dell'autore.
Contatta l'autore