GS4: Generating synthetic samples for semi-supervised nearest neighbor classification

Panagiotis Moutafis; Ioannis A. Kakadiaris

doi:10.1007/978-3-319-13186-3_36

GS4: Generating synthetic samples for semi-supervised nearest neighbor classification

Panagiotis Moutafis, Ioannis A. Kakadiaris

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.

Original language	English (US)
Title of host publication	Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops
Subtitle of host publication	DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers
Editors	Wen-Chih Peng, Haixun Wang, Zhi-Hua Zhou, Tu Bao Ho, Vincent S. Tseng, Arbee L.P. Chen, James Bailey
Publisher	Springer-Verlag
Pages	393-403
Number of pages	11
ISBN (Electronic)	9783319131856
DOIs	https://doi.org/10.1007/978-3-319-13186-3_36
State	Published - 2014
Event	International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014 - Tainan, Taiwan, Province of China Duration: May 13 2014 → May 16 2014

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	8643
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014
Country/Territory	Taiwan, Province of China
City	Tainan
Period	5/13/14 → 5/16/14

Keywords

Classification
K-nearest neighbor
Semi-supervised learning
Synthetic samples

ASJC Scopus subject areas

Theoretical Computer Science
Computer Science(all)

Access to Document

10.1007/978-3-319-13186-3_36

Cite this

Moutafis, P., & Kakadiaris, I. A. (2014). GS4: Generating synthetic samples for semi-supervised nearest neighbor classification. In W.-C. Peng, H. Wang, Z.-H. Zhou, T. B. Ho, V. S. Tseng, A. L. P. Chen, & J. Bailey (Eds.), Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers (pp. 393-403). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8643). Springer-Verlag. https://doi.org/10.1007/978-3-319-13186-3_36

GS4: Generating synthetic samples for semi-supervised nearest neighbor classification. / Moutafis, Panagiotis; Kakadiaris, Ioannis A.
Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers. ed. / Wen-Chih Peng; Haixun Wang; Zhi-Hua Zhou; Tu Bao Ho; Vincent S. Tseng; Arbee L.P. Chen; James Bailey. Springer-Verlag, 2014. p. 393-403 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8643).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Moutafis, P & Kakadiaris, IA 2014, GS4: Generating synthetic samples for semi-supervised nearest neighbor classification. in W-C Peng, H Wang, Z-H Zhou, TB Ho, VS Tseng, ALP Chen & J Bailey (eds), Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8643, Springer-Verlag, pp. 393-403, International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014, Tainan, Taiwan, Province of China, 5/13/14. https://doi.org/10.1007/978-3-319-13186-3_36

Moutafis P, Kakadiaris IA. GS4: Generating synthetic samples for semi-supervised nearest neighbor classification. In Peng WC, Wang H, Zhou ZH, Ho TB, Tseng VS, Chen ALP, Bailey J, editors, Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers. Springer-Verlag. 2014. p. 393-403. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-13186-3_36

Moutafis, Panagiotis ; Kakadiaris, Ioannis A. / GS4 : Generating synthetic samples for semi-supervised nearest neighbor classification. Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Revised Selected Papers. editor / Wen-Chih Peng ; Haixun Wang ; Zhi-Hua Zhou ; Tu Bao Ho ; Vincent S. Tseng ; Arbee L.P. Chen ; James Bailey. Springer-Verlag, 2014. pp. 393-403 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{829510a9f0ae4be49b0b015a8889a549,

title = "GS4: Generating synthetic samples for semi-supervised nearest neighbor classification",

abstract = "In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.",

keywords = "Classification, K-nearest neighbor, Semi-supervised learning, Synthetic samples",

author = "Panagiotis Moutafis and Kakadiaris, {Ioannis A.}",

note = "Funding Information: This research was funded in part by the US Army Research Lab (W911NF-13-1-0127) and the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors. Publisher Copyright: {\textcopyright} Springer International Publishing Switzerland 2014.; International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014 ; Conference date: 13-05-2014 Through 16-05-2014",

year = "2014",

doi = "10.1007/978-3-319-13186-3_36",

language = "English (US)",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer-Verlag",

pages = "393--403",

editor = "Wen-Chih Peng and Haixun Wang and Zhi-Hua Zhou and Ho, {Tu Bao} and Tseng, {Vincent S.} and Chen, {Arbee L.P.} and James Bailey",

booktitle = "Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops",

}

TY - GEN

T1 - GS4

T2 - International Workshops on Data Mining and Decision Analytics for Public Health, Biologically Inspired Data Mining Techniques, Mobile Data Management, Mining, and Computing on Social Networks, Big Data Science and Engineering on E-Commerce, Cloud Service Discovery, MSMV-MBI, Scalable Dats Analytics, Data Mining and Decision Analytics for Public Health and Wellness, Algorithms for Large-Scale Information Processing in Knowledge Discovery, Data Mining in Social Networks, Data Mining in Biomedical informatics and Healthcare, Pattern Mining and Application of Big Data in conjunction with 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2014

AU - Moutafis, Panagiotis

AU - Kakadiaris, Ioannis A.

N1 - Funding Information: This research was funded in part by the US Army Research Lab (W911NF-13-1-0127) and the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors. Publisher Copyright: © Springer International Publishing Switzerland 2014.

PY - 2014

Y1 - 2014

N2 - In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.

AB - In this paper, we propose a method to improve nearest neighbor classification accuracy under a semi-supervised setting. We call our approach GS4 (i.e., Generating Synthetic Samples Semi-Supervised). Existing self-training approaches classify unlabeled samples by exploiting local information. These samples are then incorporated into the training set of labeled data. However, errors are propagated and misclassifications at an early stage severely degrade the classification accuracy. To address this problem, the proposed method exploits the unlabeled data by using weights proportional to the classification confidence to generate synthetic samples. Specifically, our scheme is inspired by the Synthetic Minority Over-Sampling Technique. That is, each unlabeled sample is used to generate as many labeled samples as the number of classes represented by its k-nearest neighbors. In particular, the distance of each synthetic sample from its k-nearest neighbors of the same class is proportional to the classification confidence. As a result, the robustness to misclassification errors is increased and better accuracy is achieved. Experimental results using publicly available datasets demonstrate that statistically significant improvements are obtained when the proposed approach is employed.

KW - Classification

KW - K-nearest neighbor

KW - Semi-supervised learning

KW - Synthetic samples

UR - http://www.scopus.com/inward/record.url?scp=84915818958&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84915818958&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-13186-3_36

DO - 10.1007/978-3-319-13186-3_36

M3 - Conference contribution

AN - SCOPUS:84915818958

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 393

EP - 403

BT - Trends and Applications in Knowledge Discovery and Data Mining - PAKDD 2014 International Workshops

A2 - Peng, Wen-Chih

A2 - Wang, Haixun

A2 - Zhou, Zhi-Hua

A2 - Ho, Tu Bao

A2 - Tseng, Vincent S.

A2 - Chen, Arbee L.P.

A2 - Bailey, James

PB - Springer-Verlag

Y2 - 13 May 2014 through 16 May 2014

ER -

GS4: Generating synthetic samples for semi-supervised nearest neighbor classification

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this