Feature Extraction Methods in Dialect Identification

Matthew Sung

Leiden University

Jelena Prokic

Leiden University

Computational approaches to dialectology offer possibility to work efficiently with large amounts of dialect data and identify main dialect groups in an objective way, without concentrating on a handful of manually selected features. However, while being successful in detecting the most important dialect groups, the aggregate approach gives us little information about linguistic features characteristic for each dialect group. This kind of information is, however, very important for theoretical linguistics, but also for applied tasks of dialect and speaker identification.

To overcome this problem, several approaches have been suggested in the literature. The first aim of this presentation is to compare two of these approaches, namely a top-down method based on Fischer's linear discriminant (FLD, Prokic et al. 2012), and a bottom-up approach based on Factor Analysis (FA, Pickl 2016). FLD method is a top-down method which detects the most characteristic features of a candidate group of linguistic varieties. It proceeds from already detected groups in the data. FA approach is a bottom-up method that proceeds from the features, in our case specific word positions, and searches simultaneously for the features and groups in the data. We examine the performance of these methods using data from dialect surveys conducted in Germany that consists of 182 dialect locations and more than 140 words per each location. In addition, the data has been multi-string aligned (Prokic et al., 2009) to be able to extract features at the phonetic level.

The second aim of the presentation is to propose a new approach which we have developed based on Pointwise Mutual Information (PMI, Church and Hanks 1990). PMI is an association measure which calculates the tendency of a pair of categories to co-occur based on the probability of each category as well as the co-occurrence probability. We have compared the performance of this method to both FLD and FA methods using three evaluation criteria, namely Exclusivity (how often do we find a feature value only in a given cluster), Representativeness (the proportion of the feature value found within the dialect cluster) and lastly the Pool of Variation (number of distinctive feature values in a given feature).

Our results have shown that PMI approach has consistently shown a sub-linear relationship between the normalized PMI score (Gerlof 2009) of the feature value and its exclusivity. This pattern is not found in FA and FLD. Exclusivity is arguably the most important parameter out of the three parameters we have evaluated. Dialect features which show high exclusivity would help us understand more about dialect boundaries, which have often been discarded due to the gradualness of dialect transitions, (c.f. Paris (1888: 163) "il n’y a réellement pas de dialects; il n’y a que des traits linguistiques qui entrent respectivement dans des combinaisons diverses"). In addition, feature values found almost in one cluster can also be utilized for dialect and speaker identification. In terms of representativeness and pool of variation, none of the tested methods show a strong effect.

CLIN33
The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)
UAntwerpen City Campus: Building R
Rodestraat 14, Antwerp, Belgium
22 September 2023
logo of Clips