Project: Improving Data Quality and Data Mining Using Noisy Micro-Outsourcing (NSF IIS-1115417)

PI: Victor S. Sheng (ssheng@uca.edu), University of Central Arkansas

Abstract

Machine learning currently offers one of the most cost-effective approaches to building predictive models (e.g., classifiers for categorizing the millions of messages, news articles, and blogs that are generated every day). However, the effective use of machine learning methods in such settings is limited by the availability of a training corpus (i.e., a representative set of instances that have been labeled with the correponding categories). In domains where labeled data are scarce or expensive to acquire, there is an urgent need for cost-effective approaches to selectively acquiring labels for data samples used to train predictive models using machine learning.

This project explores novel techniques that take advantage of the low cost of micro-outsourcing using systems such as Amazon's mechanical Turk, to engage a large number of workers from around the world for acquiring the labels of instances to be used to construct the training corpus. There is currently little understanding of how to utilize the multiple noisy labels obtained using micro-outsourcing. There is a need for advanced techniques for taking advantage of the low cost of micro-outsourcing in order to improve data quality and the quality of models built from the available data. It explores novel approaches for utilizing multiple labels given to an instance by different labelers. It also extends active learning techniques for active selection of samples to be labeled to take into account the multi-sets of labels that have been already obtained from a pool of labelers.

Advances in techniques for active selection of data instances to be labeled in a micro-outsourcing setting can significantly improve the quality of data used to build predictive models in a broad range of applications, including gene annotation, image annotation, text classification, sentiment analysis, and recommender systems, where unlabeled data are plentiful yet labeled data are sparse. The project will provide research opportunities for students at University of Central Arkansas, a primarily undergraduate institution and help expand the STEM pipeline.

Publications

  • Ipeirotis, P.G., Provost, F., Sheng, V.S., & Wang, J. (2014). Repeated Labeling Using Multiple Noisy Labelers, Data Mining and Knowledge Discovery, 28(2), 402-441.
  • Gu, B. & Sheng, V.S. (2013). Feasibility and Finite Convergence Analysis for Accurate On-line ?-Support Vector Machine, IEEE Transactions on Neural Networks and Learning Systems, Volume 24 , Issue 8, 1304 - 1315.
  • Wu, J., Cui, Z, Sheng, V.S., Zhao, P., Su, D., Gong, S. (2013). A Comparative Study of SIFT and its Variants, Measurement Science Review, Volume 13, No. 3, 122-131.
  • Su, D., Wu, J., Cui, Z., Sheng, V.S., Gong, S. (2013). CGCI-SIFT: A More Efficient and Compact Representation of Local Descriptor, Measurement Science Review, Volume 13, No. 3, 132-141.
  • Zhang, J., Wu, X., & Sheng, V.S., (2013). A Threshold Method for Imbalanced Multiple Noisy Labeling, The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), 61-65, Niagara Falls, Canada, August 25-28.
  • Tawiah, C. & Sheng, V.S., (2013). Empirical Comparison of Multi-Label Classification Algorithms, Proceedings of the 27th National Conference on Artificial Intelligence (AAAI13),1645-1646, July 14-18, Bellevue, Washington.
  • Eichelberger, R.K. & Sheng, V.S., (2013). Does One-Against-All or One-Against-One Improve the Performance of Multiclass Classifications, Proceedings of the 27th National Conference on Artificial Intelligence (AAAI13), 1609-1610, July 14-18, Bellevue, Washington.
  • Zhang, J., Wu, X., & Sheng, V.S., (2013). Imbalanced Multiple Noisy Labeling for Supervised Learning, Proceedings of the 27th National Conference on Artificial Intelligence (AAAI13), 1651-1652, July 14-18, Bellevue, Washington.
  • Tawiah, C. & Sheng, V.S., (2013). A Study on Multi-Label Classification, Proceedings of the 13rd Industrial Conference on Data Mining (ICDM13), 137-150, July 16-21, New York.
  • Eichelberger, R.K. & Sheng, V.S., (2013). Empirical Study of Reducing Multiclass Classification Methodologies, Proceedings of the 9th International Conference on Machine Learning and Data Mining (MLDM13), 505-519, July 19-25, New York.
  • Sheng, V.S. (2012). Studying Active Learning in the Cost-sensitive Framework, Proceedings of the 45th Hawaii International Conference on System Sciences (HICSS-45), 1097-1106, January 4-7, Grand Wailea, Maui, Hawaii, USA.
  • Nordin, B., Hu, C., Chen, B., Sheng, V.S. (2012). Interval-Valued Centroids in K-Means Algorithms, Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), 478-481, December 12-15, Boca Raton, Florida, USA.
  • Sheng, V.S. (2011). Simple Multiple Noisy Label Utilization Strategies, Proceedings of the 2011 IEEE International Conference on Data Mining (ICDM11), 635-644, December 11-14, Vancouver, Canada. (Regular paper acceptance rate: 12.3%).
  • Sheng, V.S., Example Labeling Difficulty within Repeated Labeling, Proceedings of the 7th International Conference on Data Mining (DMIN11), 301-307, August 18-21, Las Vegas, Nevada, USA. (Acceptance rate: 24%).
  • Sheng, V.S., Tada, R., Atla, A., An Empirical Study of Noise Impacts on Supervised Learning Algorithms and Measures, Proceedings of the 7th International Conference on Data Mining (DMIN11), 266-272, August 18-21, Las Vegas, Nevada, USA. (Acceptance rate: 24%).
  • Sheng, V.S., Fast Data Acquisition in Cost-Sensitive Learning, Proceedings of the 11th Industrial Conference on Data Mining (ICDM11), 66-77, New York. (Best Paper Award). (Acceptance rate: less than 24%)
  • Sheng, V.S., Tada, R., Boosting Inspired Process for Improving AUC, Proceedings of the 7th International Conference on Machine Learning and Data Mining (MLDM11), 199-209, New York. (Acceptance rate: less than 26%).
Join the Project

Please send your resume, transcripts, and your personal statement with your schedule to Dr. Sheng at ssheng@uca.edu. You can also welcome to drop by his office MSCT313 to discuss this project. We will train you first and pay you based on your contributions.