One - One Code All

Blog Content

sklearn加载自主数据

自然语言处理 机器学习 Python   2013-06-08 21:25:21

主要使用系统自带的方法:

sklearn.datasets.load_files

具体语法:

sklearn.datasets.load_files(container_path, description=None, categories=None,
load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)

container_path:“20_newsgroups”的路径。

返回值:

data:原始数据
filenames:每个文件的名字
target:类别标签(从0开始的整数索引)
target_names:类别标签(数字)的具体含义(由子文件夹的名字决定如:alt.atheism)


具体目录结构如下:

20_newsgroups

该目录下包含如下各级子目录:

alt.atheism            rec.autos            sci.space
comp.graphics            rec.motorcycles            soc.religion.christian
comp.os.ms-windows.misc        rec.sport.baseball        talk.politics.guns
comp.sys.ibm.pc.hardware    rec.sport.hockey        talk.politics.mideast
comp.sys.mac.hardware        sci.crypt            talk.politics.misc
comp.windows.x            sci.electronics            talk.religion.misc
misc.forsale            sci.med


测试代码:

data_folder = "/20_newsgroups"

rawData = datasets.load_files(data_folder)
print(rawData)
X = rawData.data
print(X[0]) #first file content
y = rawData.target
print(y)
rawData = datasets.load_files(data_folder)
print(rawData)

相关输出:

'target_names': ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'], 'target': array([ 5, 13, 16, ...,  9, 10,  2]), 'DESCR': None


上一篇:flask中的g、add_url_rule、send_from_directory、static_url_path、static_folder的用法
下一篇:python时间序列pandas计算涨跌幅

The minute you think of giving up, think of the reason why you held on so long.