Investigating lexical-semantic effects on the red and green word order in Dutch using elastic net regression and generalised additive models

Anthe Sevenants

KU Leuven

Freek Van de Velde

KU Leuven

Dirk Speelman

KU Leuven

Lexical-semantic effects have played an important role in linguistics for a long time. Especially in construction grammar, the border between lexis and syntax is opaque as lexical material and syntactic constraints appear on the same level in a “lexicon-syntax continuum” (Hoffmann and Trousdale 2013, 1). Some efforts have been made to investigate lexical-semantic effects quantitatively (e.g. Bloem 2021), but as of yet, there is no methodology which uncovers granular, type-level semantic influences under multifactorial control. The current state-of-the-art technique captures lexical effects using random effects in a mixed model (e.g. Gries 2015), but unfortunately, this technique extends the random effect structure beyond what it was designed for and is subject to model convergence issues (Van de Velde and Pijpops 2019).

In the footsteps of Van de Velde and Pijpops (2019), we propose a next-generation methodology for researching lexical-semantic effects on morphosyntactic alternances: elastic net regression, a regression analysis technique which incorporates so-called ‘regularisation’ into its model fitting procedure. In contrast to ‘normal’ regression techniques like linear or logistic regression, elastic net regression adds a penalisation factor to the maximum likelihood procedure used to compute the model parameters. In practice, this means that the model coefficients will be ‘shrunk’ in order to guarantee fairer and more generalisable predictions, avoiding overfitting. This also allows us to fit many more variables than using traditional regression, even to the extent that we can have a dataset with n observations and p variables, where n < p. This is a powerful property for a semantic analysis with potentially hundreds of lexical items. Additional variables can still be added to the analysis, which means the analysis can remain under multifactorial control.

As a means of demonstration, we apply elastic net regression to 236,408 tokens of verbal clusters in the red and green word order in Dutch, retrieved from the SoNaR corpus (Oostdijk et al. 2008). The red-green alternance is one of the best-known, most well-researched morphosyntactic alternances in Dutch (e.g. De Sutter, Geeraerts, and Speelman 2005; Bloem 2021). By analysing the logit correction of each predictor representing a participle, we can deduce the participles’ semantic preference: 546 (55%) coefficients tend to the green word order, 452 (45%) tend to the red word order. This indicates that semantic effects are indeed at play here.

In order to glean the general semantic motivations behind the choice for either word order, we make use of dimension-reduced snaut (Mandera, Keuleers, and Brysbaert 2017) semantic vectors. With a Generalised Additive Model (Hastie and Tibshirani 1987), we model the red and green semantic preferences on the basis of the distributional coordinates with the goal of finding semantic areas which clearly tend to either of the two word orders. Our results show that these semantic areas indeed exist. We investigate the types found in the different red- or green-leaning semantic areas using Rekker, a Javascript-based interactive tool purpose-made for exploring semantic preferences.

CLIN33

The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)

UAntwerpen City Campus: Building R

Rodestraat 14, Antwerp, Belgium

22 September 2023