Comparative Analysis of the Reversed Frequency Term Method Document Frequency and Term Document Frequency Inverse Document Frequency Adaptive of Position from Word in Document Search

30/06/2020 Views : 189

I Putu Gede Hendra Suputra

1. Introduction

The ease of accessing information through the internet encourages the growing amount of information more and more diverse. This results in the need of internet users to search for information that can effectively and efficiently produce the most relevant documents. Information retrieval or known as information retrieval system aims to produce the most relevant documents based on keywords in the query given by the user. The document is considered relevant if a document matches the user's search, but the terms contained in the document and query often have many variations,

 

Basic weighting is done by calculating the frequency of occurrence of terms in documents because it is believed that the frequency of occurrence of terms is an indication of the extent to which the terms represent the contents of documents [1]. Word weighting is expected to find the most relevant information with the best index of terms. The method of weighting words based on the TF-IDF combination gives more weight to more important terms. By using the word weighting method can be obtained important information from a document based on the words contained in the document. The addition of terms to the query is also needed to improve performance in Information Retrieval (IR) or known as the Query Expansion or query expansion [3].

 

Information search is so diverse, one of which is information search in journals. In journals the position of words or sentences is very important in the search process.

 

The TF-IDF-AP algorithm can dynamically determine position weights according to word positions. In previous studies using TF-IDF-AP by introducing vector space models and designing comparative experiments using TF-IDF with TF-IDF-AP in grouping Chinese documents, the results showed using TF-IDF-AP resulted in an increase in search results by 12.9 % [3]. In this study, a search was conducted on Indonesian documents in the form of journals that had been collected through JELIKU (Journal of Computer Science) which would yield results in the form of TF-IDF-AP using first position, last position, and adaptive position (a combination of first and last position).

 

 

The purpose of this research is to compare information search results on JELIKU (Journal of Computer Science) with the TF-IDF and TF-IDF-AP algorithm using several queries or keywords to produce documents relevant to the query or keyword used.

 

 

 

2. Research Methods

2.1 TF IDF

 

TF-IDF algorithm (Term Frequency - Inverse Document Frequency) is one algorithm that can be used to analyze the relationship between a phrase / sentence with a set of documents. The TF value is calculated by the formula TF = number of selected word frequency / number of words and IDF value is calculated by the formula IDF = log (number of documents / number of selected word frequencies). Next is to multiply the TF and IDF values ​​to get the final answer.

 

 

 

2.2 TF IDF AP

TF-IDF-AP is the development of TF-IDF multiplied by the position weight of the word. The TF-IDF-AP formula is as follows:

 

TFIDFAP = TFIDF βˆ— AP

Research steps:

 

1. Prepare the Documents

 

2. Perform Preprocessing (Casefolding, Stopword Removal, Stemming, Tokenizing)

 

3. TF-IDF weighting and TF-IDF AP

 

4. Calculate cosine similarity

 

5. Know the comparison of precision, accuracy, recall, and F-measure if done TF-IDF or TF-IDF AP

 

 

3. Results and Discussion

 

    The test was conducted based on a dataset taken from JOURNAL JELIKU totaling 50 journals. there are 10 queries run. in this process it can be seen that the accuracy of the TF-IDF and TF-IDF AP is very competitive, which is still at 95%. but in this case the TF-IDF AP provides more of an advantage where weighting can be done dynamically so that it can provide more attractive results that is able to adapt to the domain of certain cases.

 

4. Conclusions

In this research that has been carried out which aims to compare information search results using the TF-IDF and TF-IDF-AP algorithms with documents obtained from JELIKU (Journal of Computer Science) using several queries or keywords to produce documents that are relevant to the query or keywords that are used. TF-IDF gives more weight to important words, but the position of words also influences search. In TF-IDF-AP get results in the form of value from TF-IDF-AP using the first position, last position, and adaptive position (combination of first and last position) of the word. With research conducted using TF-IDF produces almost the same accuracy, but in the case of certain queries, TF-IDF-AP is superior because the TF-IDF-AP algorithm can dynamically determine the weight


5. Suggestions

In this study the data used are not grouped, it would be better if you get the data that has been grouped according to the topics discussed. So that the relevant level of search with documents can be higher.