Utrecht University
Utrecht University, Leiden University
Utrecht University
Utrecht University
Utrecht University
We present MWE finder, which enables a user to search for occurrences of multiword expressions (MWEs) in large Dutch text corpora.
Component words of many MWEs in Dutch can occur in multiple forms, need not be adjacent, and can occur in multiple orders (such MWEs are called “flexible”).
Searching for occurrences of such flexible MWEs is difficult and cannot be done reliably with most search applications.
What is needed is a search engine that takes into account the grammatical configuration of the MWE, and that is what MWE finder does. It is therefore embedded in GrETEL (Version 5), a treebank search application for Dutch that allows to search treebanks for a particular grammatical structure based on an example sentence presented by the user, and to analyze these results. MWE finder builds upon that principle: a user can enter an example of a MWE in a specific canonical form or choose one from a list of more than 10k canonical forms for Dutch MWEs. After that, the system automatically creates queries that match the MWE in three different gradations of specificity, and presents sentences in which the MWE occurs from treebanks selected by the user. The queries for MWEs can be quite complicated, but are generated fully automatically from the example MWE if it is in the canonical form. We will describe what the canonical form of an MWE is, and how the queries are derived from it.
For example, if the user enters the canonical form "iemand zal de dans ontspringen" and selects to search in the MEDIARGUS treebank (a large treebank with Flemish newspaper text created by Kris Heylen from KU Leuven), the most specific generated query (called the ‘MWE query’) finds 1158 hits in over 103 million sentences. The second query, the ‘near-miss query’ yields a superset of these results, and includes cases with unexpected determiners and modifiers. It finds 1271 hits. Finally, the ‘major lemma query’, which yields a superset of the near-miss query results and just searches for the major lemmas occurring in the MWE (in any grammatical configuration), yields 1309 hits. MWE finder offers to possibility to see the result of a query without the results of more specific queries and in this way offers facilities to easily find examples with unexpected modifiers or determiners on components of the MWE, as we will show in detail for this and other examples. We submit that MWE finder is an excellent research tool for linguistic and lexicographic research into MWEs.