Big Data, Machine Learning, & Computational Lexical Semantics

10/05/2020 Views : 1192

Gede Primahadi Wijaya Rajeg

This is a post-recorded presentation for the Fakultas Ilmu Budaya Research Talk (FReTalk) at the Faculty of Arts, Udayana University (9 May 2020). The talk summarised my co-authored paper with Karlina Denistia dan Simon Musgrave (Rajeg, Denistia & Musgrave 2019) focusing on the interaction of lexical semantics and derivational morphology in Indonesian using vector space model (see Erk 2012). Below is the short summary of the talk:


This paper offers a new perspective to an old question in Indonesian linguistics regarding semantic similarity and differences between morphologically related words, especially verbal derivations with meN-, meN-/-kan, and meN-/-i affixes. We demonstrate a case study leveraging the availability of big language data (i.e. digital language corpora), quantitative corpus linguistics method, and the advance of machine-learning techniques in computational linguistics, to shed a new light on how denominal verbs with these three morphological affixes exhibit distinct/similar semantic distribution. We specifically apply a machine learning technique called Vector Space Model (VSM) combined with Hierarchical Agglomerative Clustering (HAC) analysis to capture semantic cluster and (dis)similarity between a set of Indonesian denominal verbs derived with the three affixes. We contextualise the study within the hypotheses that some -kan/-i verb pairs exhibit indistinguishable as well as distinct semantics. 


Our VSM-based cluster analysis reveals derivational families that do cluster together and those where -kan/-pairs are separated, reflecting their distinct semantics. We also found verbs of different noun roots and morphologies forming coherent semantic clusters (i.e., motion, communication, and psych verbs). We also demonstrated a technique called nearest neighbours that can be used to reveal, for a given target word we are interested in, the words closest in meaning with the target word in the whole corpus. This technique allows us to capture rich relationship for a word with all other words and reveal semantic domain separating words based on the same root. For instance, denominal verb membuahkan 'to bear fruit; to result in' is split from membuahi 'to fertilise sth.'; they are all based on the noun root buah 'fruit'. Nearest neighbours show that membuahkan is closest, among others, to verbs in the domain of creation (e.g. intransitive berbuah 'to bear fruit', tercipta 'gets created', menuai 'to harvest; reap'). In contrast, membuahi is closest to its paradigmatic forms in different grammatical voice (i.e. dibuahi [dynamic passive di-] and terbuahi [static passive ter-]) and other words in the domain of biology, such as zigot 'zygote', ovum 'ovum', and sperma 'sperm'. 


In sum, our investigation provides some usage-based, quantitative explanation as to how forms with these three affixes differ in their semantic distribution.

 



References Erk, Katrin. 2012. Vector space models of word meaning and phrase meaning: A survey. Language & Linguistics Compass 6(10). 635–653. https://doi.org/10.1002/lnco.362. Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2020. Big data, machine learning, & computational lexical semantics. figshare. doi:10.6084/m9.figshare.12272078.v1. https://figshare.com/articles/Big_data_machine_learning_computational_lexical_semantics/12272078/1 (10 May, 2020). Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019a. Vector Space Models and the usage patterns of Indonesian denominal verbs: A case study of verbs with meN-, meN-/-kan, and meN-/-i affixes. (Ed.) Hiroki Nomoto & David Moeljadi. NUSA (Linguistic Studies Using Large Annotated Corpora) 67. 35–76. (http://repository.tufs.ac.jp/handle/10108/94452) Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019b. R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. https://doi.org/10.6084/m9.figshare.9970205. https://figshare.com/articles/R_Markdown_Notebook_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/9970205. Rajeg, Gede Primahadi Wijaya, Karlina Denistia & Simon Musgrave. 2019c. Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. https://doi.org/10.6084/m9.figshare.8187155. https://figshare.com/articles/Dataset_for_i_Vector_space_model_and_the_usage_patterns_of_Indonesian_denominal_verbs_i_/8187155.