基于文本挖掘的检察起诉决策支持与案卷分类管理系统

管理评论 ›› 2022, Vol. 34 ›› Issue (6): 143-152.

基于文本挖掘的检察起诉决策支持与案卷分类管理系统

石勇^1,2,3,4, 安文录^1,5, 曲艺^1,2,3

1. 中国科学院大学经济与管理学院, 北京 100190;
2. 中国科学院虚拟经济与数据科学研究中心, 北京 100190;
3. 中国科学院大数据挖掘与知识管理重点实验室, 北京 100190;
4. College of Information Science and Technology, University of Nebraska at Omaha, NE 68182, USA;
5. 上海市浦东新区人民检察院, 上海 200135

收稿日期:2020-09-29 出版日期:2022-06-28 发布日期:2022-07-22
通讯作者: 曲艺(通讯作者),中国科学院大学经济与管理学院、中国科学院虚拟经济与数据科学研究中心、中国科学院大数据挖掘与知识管理重点实验室博士研究生。
作者简介:石勇,中国科学院大学经济与管理学院教授,中国科学院虚拟经济与数据科学研究中心主任,中国科学院大数据挖掘与知识管理重点实验室主任,Distinguished Chair Professor,College of Information Science and Technology,University of Nebraska at Omaha,博士生导师;安文录,中国科学院大学经济与管理学院博士研究生,上海市浦东新区人民检察院副检察长。
基金资助:
国家自然科学基金重点项目（71932008）。

A Prosecution Decision Support and Cases Files Classification Management System Based on Text Mining

Shi Yong^1,2,3,4, An Wenlu^1,5, Qu Yi^1,2,3

1. School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190;
2. Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190;
3. Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190;
4. College of Information Science and Technology, University of Nebraska at Omaha, NE 68182, USA;
5. People's Procuratorate of Shanghai Pudong New District, Shanghai 200135

Received:2020-09-29 Online:2022-06-28 Published:2022-07-22

摘要/Abstract

摘要： “智慧检务”建设近些年取得了巨大的进展，但是大部分集中于检察信息化和数据基础设施领域，对检务工作决策支持的关注程度和相关研究成果都很有限。针对这一弱项，围绕检察工作中“对刑事犯罪提起公诉”的核心任务，结合检察官“根据案情基本信息决定以何种罪名起诉”的决策过程，本文运用文本挖掘技术建立起一套检察起诉决策支持系统。该系统主要由文本预处理、特征提取、分类等流程组成，输入是案情描述的文本，输出是对应的起诉罪名。实验结果显示，该系统在多种分类模型下、不同的特征数量下、不同的文本向量表示方法下均能取得较高的准确率，不仅实现了有效的、高精度的起诉决策支持，也提升了案卷分类管理的效率。本文成果是大数据挖掘辅助检务决策领域的率先尝试，是提高检务工作智能化水平的具体实践，丰富了领域研究的同时，相关数据和结论亦可作为该领域应用和实践的基线，供未来参考和借鉴。

关键词: 文本挖掘, 文本分类, 起诉决策支持, 案卷分类管理

Abstract: The construction of "smart prosecution" has made great progress in recent years. However, most of the achievements are mainly in the field of prosecution informatization and data infrastructure development, with very few researches and very limited attention on decision support for prosecution. To fill this gap, based on factor that the core task in prosecution is to "prosecute criminal offenses" and prosecutors "decide which charge to prosecute according to the basic case information", this paper uses text mining techniques to establish a text classification-based decision support system for prosecution. This system mainly consists of text pre-processing, feature extraction, classification and other processes while it inputs the cases description text and automatically outputs the corresponding prosecution charges. Experimental results show that this system can achieve high accuracy with various classification models, different number of features input and different text vector representation methods. It not only achieves effective and high-precision prosecution decision support, but also improves the efficiency of case file classification management. Our research work in this paper is a pioneering attempt in the field of big data mining-assisted prosecution decision making and a concrete practice of "promoting the intelligence of prosecution". Our findings enrich the field of research and provide relevant data and conclusions that can be used as a baseline for future application and practice in this field.

Key words: text mining, text classification, decision support in prosecution, cases files classification and management

石勇, 安文录, 曲艺. 基于文本挖掘的检察起诉决策支持与案卷分类管理系统[J]. 管理评论, 2022, 34(6): 143-152.

Shi Yong, An Wenlu, Qu Yi. A Prosecution Decision Support and Cases Files Classification Management System Based on Text Mining[J]. Management Review, 2022, 34(6): 143-152.

参考文献

[1] 国务院关于印发"十三五"国家信息化规划的通知[EB/OL]. http://www.gov.cn/zhengce/content/2016-12/27/content_5153411.htm, 2016-12-27
[2] 最高检印发《全国检察机关智慧检务行动指南(2018-2020年)》[EB/OL]. http://legal.people.com.cn/n1/2018/0720/c42510-30161285.html, 2018-07-20
[3] 胡东林.打造检察信息化建设新生态[N].检察日报, 2020-08-01(003)
[4] 蔡传磊.人工智能时代庭审记录的智能化发展[N].人民法院报, 2017-12-06(008)
[5] 张玉洁.区块链技术的司法适用、体系难题与证据法革新[J].东方法学, 2019,69(3):99-109
[6] 李轩甫,汤瑞萍.海南海口龙华区:新型法治教育基地的声光电[EB/OL]. https://www.spp.gov.cn/spp/zdgz/202008/t20200812_476548.shtml, 2020-08-12
[7] 赵志刚,金鸿浩.智慧检务的演化与变迁:顶层设计与实践探索[J].中国应用法学, 2017,2(2):29-38
[8] 上海法院.揭秘"206":法院未来的人工智能图景——上海刑事案件智能辅助办案系统154天研发实录[EB/OL]. http://shfy.chinacourt.gov.cn/article/detail/2017/07/id/2921078.shtml, 2017-07-11
[9] 王永昌,朱立谷.面向Twitter情感分析的文本预处理方法研究[J].中国传媒大学学报(自然科学版), 2019,26(2):31-38
[10] 周钦强,孙炳达,王义.文本自动分类系统文本预处理方法的研究[J].计算机应用研究, 2005,(2):85-86
[11] 唐琳,郭崇慧,陈静锋.中文分词技术研究综述[J].数据分析与知识发现, 2020,4(Z1):1-17
[12] Sun Junyi. fxsjy/jieba:结巴中文分词[DB/OL]. https://github.com/fxsjy/jieba, 2012
[13] 刘群,张华平,俞鸿魁,等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展, 2004, 41(8):1421-1429
[14] Li Z., Sun M. Punctuation as Implicit Annotations for Chinese Word Segmentation[J]. Computational Linguistics, 2009,35(4):505-512
[15] 于游,付钰,吴晓平.中文文本分类方法综述[J].网络与信息安全学报, 2019,5(5):1-8
[16] 毕达天,楚启环,曹冉.基于文本挖掘的消费者差评意愿的影响因素研究[J].情报理论与实践, 2020,43(10):137-143
[17] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究, 2001,(9):23-26
[18] Kou G., Yang P., Peng Y., et al. Evaluation of Feature Selection Methods for Text Classification with Small Datasets Using Multiple Criteria Decision-making Methods[J]. Applied Soft Computing, 2019,86;105836
[19] Yang Y., Pedersen J. O. A Comparative Study on Feature Selection in Text Categorization[C]. Proceedings of the 14th International Conference on Machine Learning (ICML), 1997
[20] Lee C., Lee G. Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization[J]. Information Processing&Management, 2006,42(1):155-165
[21] Shang W., Huang H., Zhu H., et al. A Novel Feature Selection Algorithm for Text Categorization[J]. Expert Systems with Applications, 2007,33(1):1-5
[22] Jain A., Zongker D. Feature Selection:Evaluation, Application, and Small Sample Performance[J]. IEEE Transactions on Pattern Analysis&Machine Intelligence, 1997,19(2):153-158
[23] 周源,刘怀兰,杜朋朋,等.基于改进TF-IDF特征提取的文本分类模型研究[J].情报科学, 2017,35(5):111-118
[24] 叶雪梅,毛雪岷,夏锦春,等.文本分类TF-IDF算法的改进研究[J].计算机工程与应用, 2019,55(2):104-109
[25] 王海雷.面向高维数据的特征学习算法研究[D].中国科学技术大学博士学位论文, 2019
[26] 郑飞,韦德壕,黄胜.基于LDA和深度学习的文本分类方法[J].计算机工程与设计, 2020,41(8):2184-2189
[27] 刘婷婷,朱文东,刘广一.基于深度学习的文本分类研究进展[J].电力信息与通信技术, 2018,16(3):1-7
[28] Mikolov T., Chen K., Corrado G., et al. Efficient Estimation of Word Representations in Vector Space[C]. ICLR, 2013
[29] Kim Y. Convolutional Neural Networks for Sentence Classification[C]. EMNLP, 2014
[30] Lai S., Xu L., Liu K., Zhao J. Recurrent Convolutional Neural Networks for Text Classification[C]. AAAI, 2015
[31] 吴江,侯绍新,靳萌萌,等.基于LDA模型特征选择的在线医疗社区文本分类及用户聚类研究[J].情报学报, 2017,36(11):1183-1191
[32] 田苗苗.基于决策树的文本分类研究[J].吉林师范大学学报(自然科学版), 2008,(1):54-56
[33] 王国薇,黄浩,周刚,等.集成学习在短文本分类中的应用研究[J].现代电子技术, 2019,42(24):140-145
[34] 贺鸣,孙建军,成颖.基于朴素贝叶斯的文本分类研究综述[J].情报科学, 2016,34(7):147-154
[35] 王芝辉,王晓东.基于神经网络的文本分类方法研究[J].计算机工程, 2020,46(3):11-17
[36] 平源.基于支持向量机的聚类及文本分类研究[D].北京邮电大学博士学位论文, 2012
[37] 杨锋.基于线性支持向量机的文本分类应用研究[J].信息技术与信息化, 2020,(3):146-148
[38] 张庆国,张宏伟,张君玉.一种基于k最近邻的快速文本分类方法[J].中国科学院研究生院学报, 2005,(5):554-559
[39] 朱青,卫柯臻,丁兰琳,等.基于文本挖掘和自动分类的法院裁判决策支持系统设计[J].中国管理科学, 2018,26(1):170-178
[40] 舒洪水.司法大数据文本挖掘与量刑预测模型的研究[J].法学, 2020,(7):113-129
[41] 佘贵清,张永安.审判案例自动抽取与标注模型研究[J].现代图书情报技术, 2013,(6):23-29
[42] Vogl T. M., Seidelin C., Ganesh B., et al. Smart Technology and the Emergence of Algorithmic Bureaucracy:Artificial Intelligence in UK Local Authorities[J]. Public Administration Review, 2020,80(6):946-961
[43] 孙茂松,李景阳,郭志芃,等. THUCTC:一个高效的中文文本分类工具包[DB/OL]. http://thuctc.thunlp.org/, 2016
[44] Swami A., Jain R. Scikit-learn:Machine Learning in Python[J]. Journal of Machine Learning Research, 2013,12(10):2825-2830