主要使用系统自带的方法:
sklearn.datasets.load_files
具体语法:
sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
container_path:“20_newsgroups”的路径。
返回值:
data:原始数据
filenames:每个文件的名字
target:类别标签(从0开始的整数索引)
target_names:类别标签(数字)的具体含义(由子文件夹的名字决定如:alt.atheism)
具体目录结构如下:
20_newsgroups
该目录下包含如下各级子目录:
alt.atheism rec.autos sci.space
comp.graphics rec.motorcycles soc.religion.christian
comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns
comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast
comp.sys.mac.hardware sci.crypt talk.politics.misc
comp.windows.x sci.electronics talk.religion.misc
misc.forsale sci.med
测试代码:
data_folder = "/20_newsgroups" rawData = datasets.load_files(data_folder) print(rawData) X = rawData.data print(X[0]) #first file content y = rawData.target print(y) rawData = datasets.load_files(data_folder) print(rawData)
相关输出:
'target_names': ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'], 'target': array([ 5, 13, 16, ..., 9, 10, 2]), 'DESCR': None