MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction

Gideon Maillette de Buy Wenniger

Open University of the Netherlands, University of Groningen

Thomas van Dongen

University of Groningen

Lambert Schomaker

University of Groningen

Automatic assessment of the quality of scholarly documents is a difficult task with high potential impact. Objective truths, i.e., large public labeled training data for document-quality evaluation, are lacking. Focusing on what is possible in scholarly document quality prediction (SDQP), we look at imperfect yet well-defined prediction tasks that provide a realistic proxy to prediction of scholarly document quality. Prediction of the number of citations and the acceptance/rejection for high-profile venues are two such feasible SDQP substitute tasks that have already been approached with a fair amount of success. Thanks in part to the emergence of powerful language models such as BERT and its successors, and their more specialized variants, performance on such tasks has also been steadily increasing.

Multimodality, in particular the addition of visual information next to text, has been shown to help in improving the performance on SDQP tasks. In this research, we combine powerful textual and visual encoders into a full-blown multimodal predictive model. In particular, we combine a textual model based on chunking full paper text and combining BERT chunk encodings (SChuBERT) with a visual model based on Inception V3, we call this model MultiSChuBERT.

In this work, we contribute to the current state-of-the-art in scholarly document processing in three ways. First, we show that when combining visual and textual embeddings in this model, the method of fusing the two types of embeddings can have substantial impact on the results. Second, in fine-tuning the visual part of the model, overfitting readily occurs, preventing an optimal result. We show that gradual-unfreezing of the weights of the visual sub-model, while jointly training it with the full textual+visual model, has an additional substantial positive effect on the results. Third, we show the retained benefit of the visual encoding when combined with more recent state-of-the-art embedding models as a replacement of the standard BERT-base embeddings. In our experiments using SciBERT, scincl, specter and specter2 embeddings, we show that each of these tailored embeddings provide benefits over the standard BERT-base embeddings, with the specter2 embeddings performing the best.

Using BERT-base embeddings, on the (log) number of citations prediction task with our ACL dataset, our MultiSChuBERT-BERT-base (text+visual) model obtains an R2 score of 0.454 compared to 0.432 for the SChuBERT-BERT-base (text only) model. Comparing MultiSChuBERT-BERT-base with SChuBERT-BERT-base on the PeerRead accept/reject prediction task, we obtain only improvement using the MultiSChuBERT model, with large improvements on the PeerRead CL and LG datasets. Specifically, we obtain MultiSChuBERT vs SChuBERT accuracies of: 93.6% vs 93.5% & (AI) , 85.2 % vs 82.4% (CL) and 84.9% vs 80.3% (LG). The effect of specter2 embeddings is assessed using ACL-filtered, a smaller dataset constructed to prevent possible label-leakage favoring specter2. Here performance increases from 0.267 R2 for the SChuBERT-BERT-base model, to 0.319 R2 for SChuBERT-specter2. The performance further increases to 0.335 R2 for the MultiSChuBERT-specter2 model. In conclusion, the benefit of multimodality remains, even in combination with using the best performing domain-specifically-trained textual encoding.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023