Universiteit van Amsterdam
Universiteit van Amsterdam
Nederlandse Omroep Stichting
With the increase in the number of news articles, it is important to ensure that people do not have an overload of information and are provided with relevant recommendations. The use of Graph Neural Networks for link prediction is an emerging method for recommendation systems. Recent research in link prediction shows a new method, the SEAL framework. It can learn from enclosing subgraphs, node embeddings and node attributes. However, this method lacks the input of learning from textual data. In this research, we propose a new method for link prediction using textual data and introduce a new dataset of Dutch news articles.
The dataset we present contains 768,000 Dutch articles, and the dataset also contains metadata about the articles such as keywords, subcategories and publication date. This dataset comes from the Dutch Public Broadcasting Organisation (NOS). Together with the article dataset, a hand-labelled dataset of 300,000 links between the articles is presented. These links are the recommended articles for news articles. This is a valuable dataset because it is large, hand-labelled by news experts, and contains Dutch text.
In this study, we propose a new method for link prediction that combines the state-of-the-art framework SEAL with Context-Aware Network Embeddings (CANE). CANE learns from the text of surrounding nodes and the structure of the graph. The use of attributes is optional. The available inputs (subgraphs, embeddings and attributes) are the features for a graph neural network to predict link existence probabilities. We compared the proposed method (SEAL + CANE and SEAL + CANE + attributes) with the baseline methods SEAL (subgraphs only), SEAL + Node2vec and SEAL + Node2vec + attributes.
We evaluated the proposed link prediction model on the NOS dataset using the different models. The evaluation is performed using the standard link prediction evaluation metrics AUC and recall. To evaluate how the proposed link prediction method behaves as a recommender system for recommending Dutch news articles, TF-IDF is used as a text-only approach. The recall@k is used to evaluate it. We find that the SEAL + CANE method (AUC = 89.09) has a higher ability to discriminate the positive and negative links than SEAL (AUC = 80.18) or SEAL with Node2vec embeddings (AUC = 81.93), which do not learn from the text. When comparing the proposed method with the text-only method TF-IDF, TF-IDF has a better ability to recommend articles. The recall@k values for k from 1 to 50 were all higher for TF-IDF. However, the SEAL + CANE link prediction method had higher values for link prediction than the baseline methods.
The methods using attributes had fewer promising results, we used a fixed set of attributes from the metadata of the articles. In future work, feature selection or feature engineering can be used to get an optimal set of attributes to learn from which we expect to see an increase in performance.