English sentiment classification using a Fager & MacGowan coefficient and a genetic algorithm with a rank selection in a parallel network environment
Vo Ngoc Phu1, Vo Thi Ngoc Tran2
COMPUTER MODELLING & NEW TECHNOLOGIES 2018 22(1) 57-112
1Nguyen Tat Thanh University, 300A Nguyen Tat Thanh Street, Ward 13, District 4, Ho Chi Minh City, 702000, Vietnam
2School of Industrial Management (SIM), Ho Chi Minh City University of Technology - HCMUT, Vietnam National University, Ho Chi Minh City, Vietnam
We have already studied a data mining field and a natural language processing field for many years. There are many significant relationships between the data mining and the natural language processing. Sentiment classification has had many crucial contributions to many different fields in everyday life, such as in political Activities, commodity production, and commercial Activities. A new model using a Fager & MacGowan Coefficient (FMC) and a Genetic Algorithm (GA) with a fitness function (FF) which is a Rank Selection (RS) has been proposed for the sentiment classification. This can be applied to a big data. The GA can process many bit arrays. Thus, it saves a lot of storage spaces. We do not need lots of storage spaces to store a big data. Firstly, we create many sentiment lexicons of our basis English sentiment dictionary (bESD) by using the FMC through a Google search engine with AND operator and OR operator. Next, According to the sentiment lexicons of the bESD, we encode 7,000,000 sentences of our training data set including the 3,500,000 negative and the 3,500,000 positive in English successfully into the bit arrays in a small storage space. We also encrypt all sentences of 7,500,000 documents of our testing data set comprising the 3,750,000 positive and the 3,750,000 negative in English successfully into the bit arrays in the small storage space. We use the GA with the RS to cluster one bit array (corresponding to one sentence) of one document of the testing data set into either the bit arrays of the negative sentences or the bit arrays of the positive sentences of the training data set. The sentiment classification of one document is based on the results of the sentiment classification of the sentences of this document of the testing data set. We tested the proposed model in both a sequential environment and a distributed network system. We achieved 88.21% accuracy of the testing data set. The execution time of the model in the parallel network environment is faster than the execution time of the model in the sequential system. The results of this work can be widely used in applications and research of the English sentiment classification.