|Tipo di tesi||Tesi di laurea magistrale|
|Titolo||Uso di tecniche di Network Analysis per migliorare la ricerca per keyword.|
|Titolo in inglese||Improving a keyword search system with Network Analysis.|
|Struttura||Dipartimento di Ingegneria "Enzo Ferrari"|
|Corso di studi||Ingegneria Informatica (D.M.270/04)|
|Data inizio appello||2015-04-16|
|Disponibilità||Accessibile via web (tutti i file della tesi sono accessibili)|
Il sistema implementato ha lo scopo di migliorare la ricerca per keyword attraverso la Network Analysis. Nel caso specifico sono utilizzate pubblicazioni scientifiche fornite dal database DBLP. Il sistema prende in input un paper, trova paper simili e li classifica sulla base della loro importanza (più i valori di centralità ottenuti dall’analisi delle reti sono alti, più il paper è considerato importante).
The goal of this thesis is to improve a keyword search system with network analysis. In this case, data are taken from DBLP database, which provided information about scientific publication. The system takes a paper in input, it finds similar papers and it ranks them on the basis of their importance (the more the values of centrality of papers taken from network analysis are higher, the more the paper is considered important). First of all the creation and analysis of the networks of papers is necessary. It is possible to create different types of networks depending on the data stored in the database. Two networks are created. The first one is the co-authors network, where nodes represent authors and edges link nodes (authors) who have written one or more papers together. Both un-weighted and weighted network are analysed. The second one is the topic network, where nodes represent words included in the titles of papers and edges link nodes (words) that appear in the same title. The analysis is made considering several metrics provided by network analysis, but we focus on only few measures able to define the centrality of the node: degree, betweenness and PageRank. The main idea is that of exploiting these centrality measures to rank scientific publication: the more the authors of a paper and the words contained in the title are central in the network, the more the paper is considered important. After the analysis of the networks build, in order to perform the similarity between two papers, a matrix is generated. This matrix is composed of words on the columns (baseline features) and titles of papers on the rows. Words appearing on the columns are the top-n words based on degree centrality extracted from the networks (in particular the total number of words is a combination of the top-1400 words of undirected un-weighted, undirected weighted and directed networks, the value is set to 1486). The element of the matrix are 0 if the word is not contained in the title, 1 otherwise. In order to evaluate the similarity between papers, a technique based on LSH is used. Baseline features of the paper provided in input are compared to baseline features of papers in the matrix, and 20 papers are returned (this parameter can be set to other values). Then these papers are ranked using metrics from the networks, with two different possibility: in descending order on the basis of networks' measures or in descending order on the basis of the similarity between the measures of the network of the input paper and the papers returned from the previous step. The system is evaluated in different ways in order to define its efficiency.