Description
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
Comparative Study of Artificial Neural Network based Classification for Liver Patient
Abstract The extensive accessibility of new computational methods and tools for data analysis and predictive modeling requires medical informatics researchers and practitioners to steadily select the most appropriate strategy to cope with clinical prediction problems. Data mining offers methodological and technical solutions to deal with the analysis of medical data and construction of prediction models. Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. Therefore, in this study, Liver patient data is considered and evaluated by univariate analysis and a feature selection method for predicator attributes determination. Further comparative study of artificial neural network based predictive models such as BP, RBF, SOM, SVM are provided. Keywords: Medical Informatics, Classification, Liver Data, Artificial Neural Network 1. Introduction Data mining has been increasingly used in the medical literature over the last few years. In general, the term has not been anchored to any precise definition but to some sort of common understanding of its meaning: the use of methods and tools to analyze large amounts of data. Its application to the analysis of medical data - notwithstanding high hopes has until recently been relatively limited. This is particularly true of practical applications in clinical medicine which may benefit from specific data mining approaches that are able to perform predictive modelling, exploit the knowledge available in the clinical domain and explain proposed decisions once the models are used to support clinical decisions. The goal of predictive data mining in clinical medicine is to derive models that can use patient specific information to predict the outcome of interest and to thereby support clinical decision-making. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which - once evaluated and verified - may be embedded within clinical information systems (Bellazzi & Zupan 2008). The aim of this study is to provide a comparative framework among different neural network based predicative data mining techniques for diagnosing Liver disorder. The work is organized as follows. In the section 2, related works are reported. In the section 3 materials and methods are presented. The results of the experiments are presented in section 4. Final, this paper is concluded in section 5. 2. Related Works A literature survey showed that there have been several studies on the survivability prediction problem using statistical approaches and artificial neural networks. However, a few studies related to medical diagnosis and survivability using data mining approaches have been reported. Bellaachia and Guven (2006) presented an analysis of the prediction of survivability rate of breast cancer patients' using data mining techniques. The data used is the SEER Public-Use Data. The investigated three data mining techniques: the Naïve Bayes, the back-propagated neural network, and the C4.5 decision tree algorithms. Experiments were conducted using these algorithms. The achieved prediction performances are comparable to
1
existing techniques. However, we found out that C4.5 algorithm has a much better performance than the other two techniques. Sug (2012) applied decision trees, C4.5 and CART to Liver Disorder Disease. Ramana et al. (2011) applied the classification algorithms considered here are Naïve Bayes, classifier, C4.5, Back propagation Neural Network algorithm, and Support Vector Machines for liver patients. In this study, they considered and men and women in single data unit. Also the children are considered in same data sample. The adult men, adult women and children data may be considered separately due to different physiological structures. Also, significant differences between means among women and men groups were measured. Therefore, in this work, the men and women are considered in different data set and more refined approaches RBF, SOM algorithms are applied. 3. Materials and Methods 3.1 Data Set In this study, Indian Liver Patient Dataset (ILPD) (Frank & Asumcion 2010) was used. The data set was collected from north east of Andhra Pradesh, India (Ramana et al. 2012; Ramana et al. 2011). This data set comprises Age of the patient, Total Bilirubin, Direct Bilirubin Alkaline Phosphotase, Alamine Aminotransferase, Aspartate Aminotransferase, Total Protiens, Albumin, Albumin and Globulin Ratio, class labeled by the experts. Selector is a class label used to divide into groups (liver patient or not). In this work, the noisy data were ignored and only the age greater than 17 were considered. Men data set contains 316 liver patient and 105 non liver patient after ignoring the noisy and age less than 18. Similarly, women data set contains 86 liver patient and 47 non liver patient records. The detail descriptive statistics for Men and Women data set are given in the Table 1 and 2. Table 1. Descriptive Statistics for Men N= 421 Min Age Total Bilirubin Direct Bilirubin Total Protein Albumin A/G Ratio SGPT SGOT Alkphos 18 0.4 0.1 75 10 11 2.7 0.9 0.3 Max 90 75 19.7 2110 2000 4929 9.6 5.5 2.8 Mean 46.7?15.1 3.7?6.7 1.7?3.0 284.1?224.3 90.7?206.7 125.9?330.4 6.4?1.1 3.1?0.8 0.94?0.3 Ske w 0.08 4.84 3.01 3.94 5.98 9.57 -.279 0.02 1.12 Kurtosis -0.82 35.89 10.15 21.05 40.78 120.18 0.526 -0.372 3.83
Table 2. Descriptive Statistics for Women N= 133 M in Max Mean Ske w Age Total Bilirubin Direct Bilirubin Total Protein Albumin A/G Ratio SGPT SGOT Alkphos 18 0.5 0.1 63 10 10 3.6 1.0 0.3 85 27.7 12.8 1896 509 623 9.2 5.5 1.8 44.8?14.7 2.2?4.5 0.96?2.2 296.7?293.5 50.7?72.6 63.7?100.1 6.63?1.1 3.25?0.8 0.94?0.3 0.37 4.1 3.98 3.46 4.09 3.85 -0.313 -0.14 0.69
Kurtosis -0.22 17.03 15.88 12.95 19.40 16.31 -0.39 -0.14 1.22
2
3.2 Methods Artificial Neural Network (ANN) is widely used algorithms in many applications and modeled based on biological neural systems. For classification, the model is trained in such a way that input and output nodes are connected with a set of weighted links based on input-output association of training data. More complex ANNs can have one or more hidden layers with hidden nodes in between input and output nodes to model more complex input-output relationships. Depending upon the application, either a feed-forward (nodes in one layer is connected only to nodes in next layer) or recurrent (nodes can be connected to nodes in the same layer, previous layer or next layer) network is built. The relationship between input and output is defined by an activation function which along with the weighted links decides the behaviour of the ANN. The activation function can be linear, sigmoid (logistic) and hyperbolic tangent function (Kotsiantis 2007). In this paper, back propagations (BP) learning, radial basis function networks (RBF), self organizing map (SOM), and support vector machine (SVM) methods are used to classify the Liver patient data. Back Propagation is a multi layer preceptron neural network. The term back propagation means the backward propagation of an error signal through the network. After propagating a pattern through the network - feed forward, the output pattern is compared with a given target and the error of each output unit is calculated. This error is propagated backwards to the input layer - back propagation. Finally the errors of the units are used to modify the weights. The back propagation algorithm is based on Widrow-Hoff delta learning rule in which the weight adjustment is done through mean square error of the output response to the sample input (Sivanandam et al. 2006). High-dimensional input patterns with prototype lattice structure are represented by SOM and can be visualized in two-dimensional lattice structure. Each unit in the lattice (neuron) and adjacent neurons are interconnected, which gives the clear topology of how the network fits itself to the input space. Input patterns are fully connected to all neurons via adaptable weights, and during the training process, neighboring input patterns are projected into the lattice, corresponding to adjacent neurons (Sivanandam et al. 2006). SOM is also used for supervised learning (Salah et al. 2009). RBF is a special type of neural networks with several distinctive features. A RBF network consists of three layers, namely the input layer, the hidden layer, and the output layer. The input layer broadcasts the coordinates of the input vector to each of the units in the hidden layer. Each unit in the hidden layer then produces an activation based on the associated radial basis function. Finally, each unit in the output layer computes a linear combination of the activations of the hidden units. How a RBF network reacts to a given input stimulus is completely determined by the activation functions associated with the hidden units and the weights associated with the links between the hidden layer and the output layer (Oyang et al. 2005). There are several types of radial basis function which can be used for classification tasks such as Gaussian radial function, Normalized Gaussian radial function, Thin plate spline, Quadratic, Inverse quadratic (Fornberg & Piret 2008). In this work, Normalized Gaussian radial function was used. SVMs, like other nonparametric classifiers such as Artificial Neural Networks, boast a robustness that has spearheaded its application into many areas. Like other supervised classifiers, training data is a prerequisite to define the decision boundaries within the feature space, based upon which classification decision rules are made. For SVMs, this decision boundary is a linear discriminate placed midway between the classes of interest. SVMs handle nonlinear datasets by use of a kernel function. Some examples of functions (also called kernels) used to this effect include: polynomial, Gaussian (more commonly referred to as radial basis functions) and sigmoid functions. Each function has parameters which have to be determined prior to classification and they are usually determined through a cross validation process. Operating in high dimension potentially renders the risk of over-fitting in the input space possible. SVMs control this through the principle of Structural Risk Minimization. The empirical risk of misclassification is controlled by maximizing the margin between the training data and the decision boundary (Anthony et al. 2008; Vapnik 1995). 4. Experimental Investigation and Results Initially, the predicator attributes for the classification task was determined by computing significant differences between classes; in this regard, the univariate analysis of variance was applied. Also, correlation-based feature subset selection method was used to evaluate the predicator attributes (Hall et al. 2009; Hall 1999). Total Bilirubin, Direct Bilirubin, Total Protein, Albumin, A/G Ratio, SGOT, and Alkphos were found significant difference in men data set
3
by univariate analysis. Total Bilrubin, Direct Bilirubin, Total Protein and A/G Ratio were found significant difference in women data set by univariate analysis. The correlation-based feature subset selection method selected Total Bilirubin, Direct Bilirubin, Total Protein, Albumin, A/G Ratio in case of men data and Total Bilirubin, Direct Bilirubin, Total Protein, A/G Ratio in case of women data. In this work, common attributes between univariate analysis and feature subset selection method were considered for classification task in both cases of men and women data set. The artificial networks classifiers i.e. BP, SOM, RBF, and SVM were trained and tested. The experiments were conducted in Weka 3.7 (Hall et al. 2009). The summary of results in case of men and women are shown on Table 3 and 4, respectively. Our experimental investigation yields a significant output in terms of the correctly classified success rate. The SVM provided high correctly classified success rate 99.76% and 97.7% in case of men and women data. Table 3. Summary of classification result for Men N= 421 Parameters Accuracy % Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Coverage of cases (0.95 level) % BP 74.8 0.33 0.40 88.0 92.29 99.29 SO M 75.05 0.24 0.49 66.50 115.42 75.05 RBF 75.8 0.30 0.39 81.48 90.30 99.52 SVM 99.76 0.002 0.05 0.63 11.26 99.76
Table 4. Summary of classification result for Women N= 113 Parameters BP SO M Accuracy % 64.7 64.7 Mean absolute error 0.40 0.35 Root mean squared error 0.44 0.59 0.44 Relative absolute error % 88.4 77.21 Root relative squared error % 93.4 124.4 Coverage of cases (0.95 level) % 100 64.7 5. Conclusion
RBF 66.2 0.38 0.15 83.1 91.2 100
SVM 97.7 0.02 4.9 31.4 97.7
Predictive data mining is becoming an important instrument for researchers and clinical practitioners in medicine. Understanding the main issues underlying these methods and the application of agreed and standardized procedures is mandatory for their deployment and the dissemination of results. In this study, the Liver data was evaluated by univariate analysis and feature selection methods and predicator attributes are determined. Further, ANN based classifiers were applied on selected attributes. It shows the ANN classifiers may be used as patient predicator tool.
doc_257327133.docx
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
Comparative Study of Artificial Neural Network based Classification for Liver Patient
Abstract The extensive accessibility of new computational methods and tools for data analysis and predictive modeling requires medical informatics researchers and practitioners to steadily select the most appropriate strategy to cope with clinical prediction problems. Data mining offers methodological and technical solutions to deal with the analysis of medical data and construction of prediction models. Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. Therefore, in this study, Liver patient data is considered and evaluated by univariate analysis and a feature selection method for predicator attributes determination. Further comparative study of artificial neural network based predictive models such as BP, RBF, SOM, SVM are provided. Keywords: Medical Informatics, Classification, Liver Data, Artificial Neural Network 1. Introduction Data mining has been increasingly used in the medical literature over the last few years. In general, the term has not been anchored to any precise definition but to some sort of common understanding of its meaning: the use of methods and tools to analyze large amounts of data. Its application to the analysis of medical data - notwithstanding high hopes has until recently been relatively limited. This is particularly true of practical applications in clinical medicine which may benefit from specific data mining approaches that are able to perform predictive modelling, exploit the knowledge available in the clinical domain and explain proposed decisions once the models are used to support clinical decisions. The goal of predictive data mining in clinical medicine is to derive models that can use patient specific information to predict the outcome of interest and to thereby support clinical decision-making. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which - once evaluated and verified - may be embedded within clinical information systems (Bellazzi & Zupan 2008). The aim of this study is to provide a comparative framework among different neural network based predicative data mining techniques for diagnosing Liver disorder. The work is organized as follows. In the section 2, related works are reported. In the section 3 materials and methods are presented. The results of the experiments are presented in section 4. Final, this paper is concluded in section 5. 2. Related Works A literature survey showed that there have been several studies on the survivability prediction problem using statistical approaches and artificial neural networks. However, a few studies related to medical diagnosis and survivability using data mining approaches have been reported. Bellaachia and Guven (2006) presented an analysis of the prediction of survivability rate of breast cancer patients' using data mining techniques. The data used is the SEER Public-Use Data. The investigated three data mining techniques: the Naïve Bayes, the back-propagated neural network, and the C4.5 decision tree algorithms. Experiments were conducted using these algorithms. The achieved prediction performances are comparable to
1
existing techniques. However, we found out that C4.5 algorithm has a much better performance than the other two techniques. Sug (2012) applied decision trees, C4.5 and CART to Liver Disorder Disease. Ramana et al. (2011) applied the classification algorithms considered here are Naïve Bayes, classifier, C4.5, Back propagation Neural Network algorithm, and Support Vector Machines for liver patients. In this study, they considered and men and women in single data unit. Also the children are considered in same data sample. The adult men, adult women and children data may be considered separately due to different physiological structures. Also, significant differences between means among women and men groups were measured. Therefore, in this work, the men and women are considered in different data set and more refined approaches RBF, SOM algorithms are applied. 3. Materials and Methods 3.1 Data Set In this study, Indian Liver Patient Dataset (ILPD) (Frank & Asumcion 2010) was used. The data set was collected from north east of Andhra Pradesh, India (Ramana et al. 2012; Ramana et al. 2011). This data set comprises Age of the patient, Total Bilirubin, Direct Bilirubin Alkaline Phosphotase, Alamine Aminotransferase, Aspartate Aminotransferase, Total Protiens, Albumin, Albumin and Globulin Ratio, class labeled by the experts. Selector is a class label used to divide into groups (liver patient or not). In this work, the noisy data were ignored and only the age greater than 17 were considered. Men data set contains 316 liver patient and 105 non liver patient after ignoring the noisy and age less than 18. Similarly, women data set contains 86 liver patient and 47 non liver patient records. The detail descriptive statistics for Men and Women data set are given in the Table 1 and 2. Table 1. Descriptive Statistics for Men N= 421 Min Age Total Bilirubin Direct Bilirubin Total Protein Albumin A/G Ratio SGPT SGOT Alkphos 18 0.4 0.1 75 10 11 2.7 0.9 0.3 Max 90 75 19.7 2110 2000 4929 9.6 5.5 2.8 Mean 46.7?15.1 3.7?6.7 1.7?3.0 284.1?224.3 90.7?206.7 125.9?330.4 6.4?1.1 3.1?0.8 0.94?0.3 Ske w 0.08 4.84 3.01 3.94 5.98 9.57 -.279 0.02 1.12 Kurtosis -0.82 35.89 10.15 21.05 40.78 120.18 0.526 -0.372 3.83
Table 2. Descriptive Statistics for Women N= 133 M in Max Mean Ske w Age Total Bilirubin Direct Bilirubin Total Protein Albumin A/G Ratio SGPT SGOT Alkphos 18 0.5 0.1 63 10 10 3.6 1.0 0.3 85 27.7 12.8 1896 509 623 9.2 5.5 1.8 44.8?14.7 2.2?4.5 0.96?2.2 296.7?293.5 50.7?72.6 63.7?100.1 6.63?1.1 3.25?0.8 0.94?0.3 0.37 4.1 3.98 3.46 4.09 3.85 -0.313 -0.14 0.69
Kurtosis -0.22 17.03 15.88 12.95 19.40 16.31 -0.39 -0.14 1.22
2
3.2 Methods Artificial Neural Network (ANN) is widely used algorithms in many applications and modeled based on biological neural systems. For classification, the model is trained in such a way that input and output nodes are connected with a set of weighted links based on input-output association of training data. More complex ANNs can have one or more hidden layers with hidden nodes in between input and output nodes to model more complex input-output relationships. Depending upon the application, either a feed-forward (nodes in one layer is connected only to nodes in next layer) or recurrent (nodes can be connected to nodes in the same layer, previous layer or next layer) network is built. The relationship between input and output is defined by an activation function which along with the weighted links decides the behaviour of the ANN. The activation function can be linear, sigmoid (logistic) and hyperbolic tangent function (Kotsiantis 2007). In this paper, back propagations (BP) learning, radial basis function networks (RBF), self organizing map (SOM), and support vector machine (SVM) methods are used to classify the Liver patient data. Back Propagation is a multi layer preceptron neural network. The term back propagation means the backward propagation of an error signal through the network. After propagating a pattern through the network - feed forward, the output pattern is compared with a given target and the error of each output unit is calculated. This error is propagated backwards to the input layer - back propagation. Finally the errors of the units are used to modify the weights. The back propagation algorithm is based on Widrow-Hoff delta learning rule in which the weight adjustment is done through mean square error of the output response to the sample input (Sivanandam et al. 2006). High-dimensional input patterns with prototype lattice structure are represented by SOM and can be visualized in two-dimensional lattice structure. Each unit in the lattice (neuron) and adjacent neurons are interconnected, which gives the clear topology of how the network fits itself to the input space. Input patterns are fully connected to all neurons via adaptable weights, and during the training process, neighboring input patterns are projected into the lattice, corresponding to adjacent neurons (Sivanandam et al. 2006). SOM is also used for supervised learning (Salah et al. 2009). RBF is a special type of neural networks with several distinctive features. A RBF network consists of three layers, namely the input layer, the hidden layer, and the output layer. The input layer broadcasts the coordinates of the input vector to each of the units in the hidden layer. Each unit in the hidden layer then produces an activation based on the associated radial basis function. Finally, each unit in the output layer computes a linear combination of the activations of the hidden units. How a RBF network reacts to a given input stimulus is completely determined by the activation functions associated with the hidden units and the weights associated with the links between the hidden layer and the output layer (Oyang et al. 2005). There are several types of radial basis function which can be used for classification tasks such as Gaussian radial function, Normalized Gaussian radial function, Thin plate spline, Quadratic, Inverse quadratic (Fornberg & Piret 2008). In this work, Normalized Gaussian radial function was used. SVMs, like other nonparametric classifiers such as Artificial Neural Networks, boast a robustness that has spearheaded its application into many areas. Like other supervised classifiers, training data is a prerequisite to define the decision boundaries within the feature space, based upon which classification decision rules are made. For SVMs, this decision boundary is a linear discriminate placed midway between the classes of interest. SVMs handle nonlinear datasets by use of a kernel function. Some examples of functions (also called kernels) used to this effect include: polynomial, Gaussian (more commonly referred to as radial basis functions) and sigmoid functions. Each function has parameters which have to be determined prior to classification and they are usually determined through a cross validation process. Operating in high dimension potentially renders the risk of over-fitting in the input space possible. SVMs control this through the principle of Structural Risk Minimization. The empirical risk of misclassification is controlled by maximizing the margin between the training data and the decision boundary (Anthony et al. 2008; Vapnik 1995). 4. Experimental Investigation and Results Initially, the predicator attributes for the classification task was determined by computing significant differences between classes; in this regard, the univariate analysis of variance was applied. Also, correlation-based feature subset selection method was used to evaluate the predicator attributes (Hall et al. 2009; Hall 1999). Total Bilirubin, Direct Bilirubin, Total Protein, Albumin, A/G Ratio, SGOT, and Alkphos were found significant difference in men data set
3
by univariate analysis. Total Bilrubin, Direct Bilirubin, Total Protein and A/G Ratio were found significant difference in women data set by univariate analysis. The correlation-based feature subset selection method selected Total Bilirubin, Direct Bilirubin, Total Protein, Albumin, A/G Ratio in case of men data and Total Bilirubin, Direct Bilirubin, Total Protein, A/G Ratio in case of women data. In this work, common attributes between univariate analysis and feature subset selection method were considered for classification task in both cases of men and women data set. The artificial networks classifiers i.e. BP, SOM, RBF, and SVM were trained and tested. The experiments were conducted in Weka 3.7 (Hall et al. 2009). The summary of results in case of men and women are shown on Table 3 and 4, respectively. Our experimental investigation yields a significant output in terms of the correctly classified success rate. The SVM provided high correctly classified success rate 99.76% and 97.7% in case of men and women data. Table 3. Summary of classification result for Men N= 421 Parameters Accuracy % Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Coverage of cases (0.95 level) % BP 74.8 0.33 0.40 88.0 92.29 99.29 SO M 75.05 0.24 0.49 66.50 115.42 75.05 RBF 75.8 0.30 0.39 81.48 90.30 99.52 SVM 99.76 0.002 0.05 0.63 11.26 99.76
Table 4. Summary of classification result for Women N= 113 Parameters BP SO M Accuracy % 64.7 64.7 Mean absolute error 0.40 0.35 Root mean squared error 0.44 0.59 0.44 Relative absolute error % 88.4 77.21 Root relative squared error % 93.4 124.4 Coverage of cases (0.95 level) % 100 64.7 5. Conclusion
RBF 66.2 0.38 0.15 83.1 91.2 100
SVM 97.7 0.02 4.9 31.4 97.7
Predictive data mining is becoming an important instrument for researchers and clinical practitioners in medicine. Understanding the main issues underlying these methods and the application of agreed and standardized procedures is mandatory for their deployment and the dissemination of results. In this study, the Liver data was evaluated by univariate analysis and feature selection methods and predicator attributes are determined. Further, ANN based classifiers were applied on selected attributes. It shows the ANN classifiers may be used as patient predicator tool.
doc_257327133.docx