教育数据挖掘(educational data mining)是当代教育信息化发展的前沿研究领域,正在吸引越来越多教育学家和数据科学家的关注."大数据"时代背景下,随着数据处理规模的不断激增,现有的数据挖掘模型在单一处理节点的计算能力遭遇瓶颈,各类面向大数据处理的分布式计算框架应运而生.借助这些框架,面向解决高校就业数据挖掘问题的机器学习模型便可以满足未来大规模数据处理的需求,在未来数据集体量庞大的信息集成系统中为数据挖掘和决策支持提供帮助.以此为背景,本研究对比现有数据模型对研究目标对象的分类性能,提出了以引入输入特征加权系数来计算特征的信息增益作为特征最优分裂评判指标的改进随机森林模型来提升数据分类性能,通过仿真测试改进模型对于现有模型分类性能的提升情况,与此同时为解决大数据时代背景下面向海量数据分类任务的单节点性能瓶颈问题,提出了基于分布式改进随机森林算法的大规模学生就业数据分类预测模型.通过使用MapReduce分布式计算框架实现已训练模型在本地磁盘与分布式文件系统之间的序列化写入与反序列化加载过程,进而实现了基于改进随机森林模型的大规模数据分类模型的分布式扩展.
Abstract
Educational data mining is a research area of using data mining technology in education industry. In the research of EDM, data mining technology is used to modeling dataset samples in the field of education, which aims to study and forecast the testing data set with the help of effective statistical machine learning models. Machine learning models with distributed computing frameworks in the EDM can meet the needs of large-scale data processing meanwhile provide tailored data recommendation and then support decision-making in the future. Based on this background, this study first put all kinds of data models into the data training and predicting for simulation, propose an improved model to ameliorate the classification performance of the data model by adjusting the data model and by using an improved algorithm based on a new equation of information gain when calculating the optimal feature to split. Based on the best-performance data model in previous study combined with the application background of the "big data" era, we proposed a new random forest algorithm model focusing on giving classification to large-scale datasets based on distributed computing framework called MapReduce. By using the MapReduce, we design and realize a new system to meet this requirement. In this system, the model that has been trained can be serialized and deserialization between local disks and the distributed file system.
关键词
机器学习 /
数据分类模型 /
大数据处理 /
MapReduce
{{custom_keyword}} /
Key words
machine learning /
data classification model /
big data processing /
MapReduce
{{custom_keyword}} /
中图分类号:
TP301.6
{{custom_clc.code}}
({{custom_clc.text}})
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Peña-Ayala A. Educational data mining: A survey and a data mining-based analysis of recent works[J]. Expert Systems with Applications, 2014, 41(4): 1432-1462.
[2] Baker R S. Educational data mining: An advance for intelligent systems in education[J]. Intelligent Systems IEEE, 2014, 29(3): 78-82.
[3] Gamulin J, Gamulin O, Kermek D. Comparing classification models in the final exam performance prediction[C]//Information and Communication Technology, Electronics and Microelectronics (MIPRO), 37th International Convention on, 2014: 663-668.
[4] Bhardwaj B K, Pal S. Data mining: A prediction for performance improvement using classification[J]. World of Computer Science & Information Technology Journal, 2012, 2(4).
[5] Guruler H, Istanbullu A, Karahasan M. A new student performance analysing system using knowledge discovery in higher educational databases[J]. Computers & Education, 2010, 55(1): 247-254.
[6] Ade R, Deshmukh P R. An incremental ensemble of classifiers as a technique for prediction of student's career choice[C]//2014 First International Conference on Networks & Soft Computing (ICNSC), IEEE, 2014.
[7] Tan T, Tan L. Study on personalization recommendation system based on recruitment information[J]. Procedia Engineering, 2012, 29: 780-784.
[8] 乔非,葛彦昊. 基于机器学习分类模型的大学生就业数据分类问题研究[C]//2015中国自动化大会(CAC2015), 武汉, 2015.
[9] Mishra T, Kumar D, Gupta S. Mining students' data for prediction performance[C]//Fourth International Conference on Advanced Computing & Communication Technologies (ACCT), 2014: 255-262.
[10] Ganesh S H, Christy A J. Applications of educational data mining: A survey[C]//International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015: 1-6.
[11] Parmar K, Vaghela D, Sharma P. Performance prediction of students using distributed data mining[C]//International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015: 1-5.
[12] Parack S, Zahid Z, Merchant F. Application of data mining in educational databases for predicting academic trends and patterns[C]//IEEE International Conference on Technology Enhanced Education (ICTEE), 2012: 1-4.
[13] Siirtola P, Pyky R, Ahola R, et al. Detecting and profiling sedentary young men using machine learning algorithms[C]//IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2014: 296-303.
[14] Al-Radaideh Q A, Al-Shawakfa E M, Al-Najjar M I. Mining student data using decision trees[C]//International Arab Conference on Information Technology (ACIT'2006), Yarmouk University, Jordan, 2006.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(71690234)
{{custom_fund}}