Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S 2 0 2 0
Semi-supervised learning Training data Supervised All labeled data Model learning Some labeled data Semi-supervised Model learning Lots of unlabeled data Unsupervised learning Model All unlabeled data 2 TTCS 2020
Self-training semi-supervised algorithm Step 1: initial labeled training data L: L init Step 2: f=learn classifier (L) Step 4: Augment training data: L L self Step 5: Repeat step 2 L self = k example with most confident predictions, Step 3: Apply f on the unlabeled data: U remove these examples from the unlabeled pool Unlabeled data 3 TTCS 2020
The main contributions ➢ Evaluating the use of data points that are close to the decision boundary for improving the classification performance. ➢ Proposing a geometric base selection metric to find informative unlabeled data points. ➢ We define a new metric to measure the similarity between the labeled and unlabeled data points based on the proposed geometrical structure. ➢ We address an agreement based approach for selection from the newly-labeled data based on classifier predictions and proposed neighborhood construction algorithm. 4 TTCS 2020
Apollonius circle Apollonius circle : The Apollonius circle is the geometric location of the points on the Euclidean plane, the ratio of its distance two fixed points A and B is fixed K M d 2 d 1 D A C B 𝒆(𝑩,𝑵) K= 𝒆(𝑵,𝑪) 5 TTCS 2020
Apollonius circle B A K<1 K=1 K>1 𝑫 𝑩 𝒋𝒈 𝒍 < 𝟐 𝑫 𝑪 𝒋𝒈 𝑫 𝑩𝑪 = ቐ 𝒍 > 𝟐 𝑫 𝒋𝒐𝒈 𝒋𝒈 𝒍 = 𝟐 6 TTCS 2020
Density peak ➢ Local density ( 𝜍 𝑗 ) is defined as: 𝟐 𝒔 σ 𝑵 𝒋 𝝑𝑶(𝑵 𝒌 ) 𝒆(𝑵 𝒋 , 𝑵 𝒌 ) 𝟑 ) 𝝇 𝒋 = 𝐟𝐲𝐪(−( 𝑠 = 𝑞 × 𝑜 𝑒 𝑁 𝑗 , 𝑁 𝑘 = 𝑁 𝑗 − 𝑁 𝑘 ➢ 𝜀 𝑗 is the minimum distance between 𝑁 𝑗 and any other sample with higher density than 𝜍 𝑗 , which is define as: {𝒆(𝑵 𝒋 , 𝑵 𝒌 ) 𝜺 𝒋 = ቊ 𝒏𝒋𝒐 𝝇 𝒋 <𝝇 𝒌 𝝇 𝒋 < 𝝇 𝒌 ∃𝒌 𝒋𝒈 𝒏𝒃𝒚 𝒌 𝒆(𝑵 𝒋 , 𝑵 𝒌 ) , 𝒑𝒖𝒊𝒇𝒔𝒙𝒋𝒕𝒇 ➢ Peaks (high density points) are obtained using the score function 𝒕𝒅𝒑𝒔𝒇(𝑵 𝒋 )= 𝝇 𝒋 × 𝜺 𝒋 7 TTCS 2020
Neighborhood groups with the Apollonius circle Farthest data points are defined as: 𝒏 𝑮𝒆 𝑸 𝒖 = 𝒏𝒃𝒚 𝒆 𝑸 𝒖 , 𝑵 𝒋 |𝑵 𝒋 ∈ 𝑵 𝒃𝒐𝒆 𝒆 𝑸 𝒖 , 𝑵 𝒋 < 𝒆 𝑸 𝒖 , 𝑸 𝒖+𝟐 𝒃𝒐𝒆 𝒆 𝑸 𝒖 , 𝑵 𝒋 < 𝒏𝒋𝒐 𝒎=𝟐,𝒎∈𝒒 𝒆 𝑸 𝒎 , 𝑵 𝒋 𝒕. 𝒖. 𝒖 ≠ 𝒎 𝑢 = {1,2, … , 𝑛 − 1} 𝑁: 𝑒𝑏𝑢𝑏 𝑞𝑝𝑗𝑜𝑢𝑡, 𝑁 𝑗 ∉ 𝑄 Peak points : 𝑄 = (𝑄 1 , 𝑄 2 , … , 𝑄 𝑛 ) M={ 𝑁 𝑗 |𝑗 ∈ {1,2, … , 𝑜 − 𝑛} Farthest data points 𝑮𝑸 𝑸 𝒖 = {𝑵 𝒋 |𝒆 𝑸 𝒖 , 𝑵 𝒋 = 𝑮𝒆 𝑸 𝒖 8 TTCS 2020
Example dataset for finding farthest points and grouping 𝑮𝒆 𝟑 = 𝒏𝒃𝒚 𝒆 𝟑, 𝑵 𝒋 |𝑵 𝒋 ∈ 𝟐, 𝟒, 𝟓, 𝟕, 𝟖, 𝟘, 𝟐𝟏 𝒃𝒐𝒆 𝒆 𝟑, 𝑵 𝒋 < 𝒆 𝟔, 𝑵 𝒋 → 𝑮𝒆 𝟑 = 𝒆 𝟑, 𝟒 → 𝑮𝑸 𝟑 = 𝟒 𝒃𝒐𝒆 𝒆 𝟑, 𝑵 𝒋 < 𝒆 𝟗, 𝑵 𝒋 𝒃𝒐𝒆 𝒆 𝟑, 𝑵 𝒋 < 𝒆(𝟑, 𝟗) 9 TTCS 2020
Example for making neighborhood groups with the Apollonius circle Class1 Class1 Class1 Class2 Class2 Class2 Unlabeled data Unlabeled data Unlabeled data Peak1 Peak1 Peak2 Peak2 farthest point 10 TTCS 2020
Impact of selecting data close the decision boundary 11 TTCS 2020
Proposed semi-supervised algorithm 12 TTCS 2020
Summarize the properties of all the used datasets Name #Example #Attribute(D) #Class Iris 150 4 3 Wine 178 13 3 Seeds 210 7 3 Thyroid 215 5 3 Glass 214 9 6 Banknote 1372 4 2 Liver 345 6 2 Blood 748 4 2 13 TTCS 2020
Experimental results of comparisons accuracy of the algorithms with 10% labeled data dataset Supervised SVM Self-training SVM STC-DPC algorithm Our algorithm Iris 92.50 87 91 95.76 Wine 88.30 90.81 86.96 91.40 Seeds 84.16 74.40 81.19 92.35 Thyroid 88.95 87.21 89.65 91.72 Glass 47.44 51.15 51.15 51.93 Banknote 98.39 98.77 98.12 96.62 Liver 58.04 57.31 55.29 61.90 Blood 72.42 72.58 72.01 74.98 14 TTCS 2020
Accuracy rate of our algorithm with all unlabeled data and near decision boundary unlabeled data Dataset All unlabeled data Selected unlabeled Banknote 96.58 96.62 Liver 59.85 61.90 Blood 75.45 74.98 Heart 74.78 78.25 Hypothyroid 78.78 78.25 Diabetes 62.72 63.47 Parkinson 80.62 80.62 15 TTCS 2020
Impact of the ratio of labeled data 16 TTCS 2020
Impact of the ratio of labeled data 17 TTCS 2020
Discussion banknote 18 TTCS 2020
Discussion wine iris seeds 19 TTCS 2020
conclusion ➢ we proposed a semi-supervised self-training method based on Apollonius. ➢ First candidate data are selected from the unlabeled data to be labeled in the self-training process. Then, the peak points are found using the density peak clustering. Apollonius circle corresponding to each peak point is formed and the label of peak point is assigned to unlabeled data points in the Apollonius circle. The applied base classifier is svm which is a margin based algorithm. 20 TTCS 2020
conclusion ➢ A series of experiments was performed on several datasets and the performance of the proposed algorithm was evaluated. According to the experimental results, it is concluded that the proposed algorithm outperforms the other algorithms. Especially on datasets lacking any scattered data distribution and mixed structure . ➢ The impact of selecting data close to decision boundary was investigated and it was found that data points close to decision boundary effects the optimal change of decision boundary more than the farthest ones and also improve the classification performance. 21 TTCS 2020
References [1] Di Wu, Mingsheng Shang, X.L.J.X.H.Y.W.D., Wang, G., 2018. Self-training semi-supervised classification based on density peaks of data. Neurocomputing 275, No. C, 180-191. [2] Pourbahrami, S., Khanli, L.M., Azimpour, S., 2019. A novel and efficient data point neighborhood construction algorithm based on apollonius circle. Expert Systems with Applications 115, 57 - 67 . [3] Rodriguez, A., Laio, A., 2014. Clustering by fast search and find of density peaks. science 344, Issue 6191, 1492-1496. [4] Tanha, J., 2019. A multiclass boosting algorithm to labeled and unlabeled data. International Journal of Machine Learning and Cybernetics . [4] Tanha, J., van Someren, Maarten, A.H., 2017. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8, 355-370. [5] Tanha, J., van Someren, M., Afsarmanesh, H., 2014. Boosting for multiclass semi-supervised learning. Pattern Recognition Letters 37, 63-77. [6] Tanha, J., van Someren, M., Afsarmanesh, H., 2017. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8, 355-370 . [7] Zhou, Y., Kantarcioglu, M., Thuraisingham, B., 2012. Self-training with selection-by-rejection, ICDM ’ 12: proceedings of 2012 IEEE 12 th International Conference on Data Mining. 22 TTCS 2020
Thank you for your attention. emadi.mona@pnu.ac.ir 23 TTCS 2020
Recommend
More recommend