Representing and processing untokenized linguistic forms in the framework of computational construction grammar

Veronica Juliana Schmalz,

KU Leuven

Paul Van Eecke,

ITEC, imec research group at KULeuven,

Katrien Beuls

VUB, UNamur

Constructionist theories of language argue that meaning patterns can exist beyond the word level (Goldberg 2003). However, computational construction grammars like FCG (Steels 2017), ECG (Bergen & Chang 2005), and SBCG (Boas & Sag 2012) do not directly model this property as they rely on pre-tokenized linguistic input. Focusing on tokens instead of smaller language units can, nevertheless, lead to various issues. These include the loss of fine-grained information at the character level, such as morphological changes, limited vocabulary coverage, difficulties in handling unseen words and tokenization inconsistencies.

To date, no computational implementation handles character-level forms in construction grammar. However, shifting from words to characters can offer several advantages, including improved flexibility of language models, fine-grained morpho-syntactical analysis, capturing language-specific features and enhanced vocabulary coverage. Additionally, in the field of construction grammar, character-level analysis would enable the identification of discontinuous patterns in sequences of graphemes or phonemes, facilitating pattern-based constructions’ learning (Doumen et al. 2023).

We propose a novel approach for representing and processing linguistic form through matching and merging in Fluid Construction Grammar (Beuls & Van Eecke 2023, van Trijp et al. 2022). Our method employs two algorithms: one for string alignment and another for pattern matching. Using the Needleman-Wunsch algorithm, we align strings or character lists, producing an output that serves as input for the pattern matching algorithm. This algorithm compares each character, including spaces and gaps, and introduces variables to enclose unmatched characters until the next match is found. For instance, given the initial strings "What is the color of the cube?" and "What is the size of the ball?", the string alignment algorithm produces the aligned character sequences "What is the color of the cube?" and "What is the _size of the ball?". Applying the pattern matching algorithm, we obtain the generalization "What is the ?x1 of the ?x2?" along with two lists containing variables corresponding to the mismatched characters from the initial strings: (?x1 color) and (?x2 cube) from the first string, and (?x1 size) and (?x2 ball) from the second string.

In conclusion, our methodology can be applied to identify continuous and discontinuous patterns beyond the word level across various data types. Furthermore, it can particularly benefit computational construction grammar by facilitating constructions' learning.

Bergen, B. K., & Chang, N. (2005). Embodied construction grammar in simulation-based language understanding. Construction Grammars: Cognitive grounding and theoretical extensions. John Benjamins.

Beuls, K., & Van Eecke, P. (2023). Fluid Construction Grammar: State of the art and future outlook. In Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+ NLP, GURT/SyntaxFest 2023) (pp. 41-50).

Boas, H. C., & Sag, I. A. (Eds.). (2012). Sign-based construction grammar. Stanford, CA: CSLI Publications/Center for the Study of Language and Information.

Doumen, J., Beuls, K., & Van Eecke, P. (2023, May). Modelling language acquisition through syntactico-semantic pattern finding. In Findings of the Association for Computational Linguistics: EACL 2023 (pp. 1317-1327).

Goldberg, A. E. (2003). Constructions: A new theoretical approach to language. Trends in cognitive sciences, 7(5), 219-224.

Steels, L. (2017). Basics of fluid construction grammar. Constructions and frames, 9(2), 178-225.

van Trijp, R., Beuls, K., & Van Eecke, P. (2022). The FCG editor: An innovative environment for engineering computational construction grammars. PLOS ONE 17(6): e0269708.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips