Journal article

Selecting Representative Data Samples from Corpus Collection with Specific Target Domains

NI MADE AYU WIDIASTUTI KETUT ARTAWA I MADE RAJEG I Nyoman Udayana Gede Primahadi Wijaya Rajeg

Volume : 6 Nomor : 1 Published : 2024, June

THE INTERNATIONAL JOURNAL OF SOCIAL SCIENCES WORLD (TIJOSSW)

Abstrak

This study aims to show the availability of corpus files containing tokens generated from the search of the keywords and to select representative data samples based on the availability of the tokens. This paper reported the case study of corpus linguistics data collection procedures as careful considerations. The Leipzig Corpora Collection (LCC) in Indonesian is the data source. The corpus sampling frame was adapted by considering the structured guidelines used to define the criteria for selecting balanced and representative samples. There are two main procedures to get the representative data samples, i) to know the availability of corpus files containing tokens generated from the search of the keywords; and ii) to select the representative data samples based on the availability of the tokens. The results are ten files in ten years between 2013–2022, each of them has ±1,000 linguistic expressions as the balanced and representative data samples. Kata kunci: representative sample, target domain, availability, procedures, corpus