数据科学工具箱之julia篇

Blog Content

统计学-科学计算 Julia 2013-12-07 17:05:02

julia的定位是高效率地进行科学计算，其执行性能据说达到了C语言的级别，业界有句评价是:

Walks like python. Runs like C.

十分钟入门速查表

最最基本的语句

安装包：Pkg.add(""), 像R一样自动查找并下载指定的包，且会自动安排所依赖的包；

引用包: using

读入csv文件： readtable("yourfilename.csv")

查看数据大小: size(dataset)

查看字段名称: names(dataset)

查看数据结构类型: typeof()

查看前N条数据： head(dateset,N)

展示某数值型字段统计信息: describe(dataset[:colname])，比如最大最小值，分位数，空值数等

查看字段类型及空值数: showcols(dataset)

在列表后面添加元素： push!()，添加另一个列表 append!(), 删除元素 pop!()

检查元素elem是否在列表list中 in(elem, list), 返回true/false

获取字典的键 keys(dict), 值 values(dict), 检查dict是否存在键mykey haskey(dict, mykey)

求两个集合的交集 intersect(set1, set2), 并集 union(set1, set2)，差集 setdiff(set1 ,set2)

最最基本的数据结构

String, 字符串，

Vector, 一维向量，注意起始索引值为1，这点与其它语言不同。

Set, 集合

Dictionary，词典. 例如 a = Dict("one"=> 1, "two"=> 2, "three"=> 3)

Matrix，矩阵

最最基本的控制语句

条件判断

# Let's make a variable
some_var = 5
# Here is an if statement. Indentation is not meaningful in Julia.
if some_var > 10
    println("some_var is totally bigger than 10.")
elseif some_var < 10    # This elseif clause is optional.
    println("some_var is smaller than 10.")
else                    # The else clause is optional too.
    println("some_var is indeed 10.")
end

循环

for i in [Julia Iterable]
    expression(i)
end

x = 0
while x < 4
    println(x)
    x += 1 # Shorthand for x = x + 1
end

错误处理

try
    expression
catch e
    println(e)
end

函数、类

function add(x, y)
    println("x is $x and y is $y")
    # Functions return the value of their last statement
    x + y
end

type Person
    name::AbstractString
    male::Bool
    age::Float64
    children::Int
end

p = Person("Julia", false, 4, 0)

数据科学常用package

IJulia , 就像jupyter一样可以使用notebook形式

DataFrame，

可视化相关：Gadfly, Plots, StatPlots, PyPlot， Winston

PyCall, 这个真是神器，使用这个包，可以直接导入python中的库来使用！

RDatasets, RCall , 也是神器，安装了这两个包，就可以直接导入R中的库来使用！

ScikitLearn ，是的，你没有看错，这个包就是对python中的sklearn包的一个接口。

DecisionTree，是的，就是你想要的决策树

hello world级别的例子

using DataFrames
using DecisionTree

iris = readtable("~/iris.csv",header=true)

#查看数据大小
size(iris)

#查看几条数据
head(iris,10)

#查看特征变量统计值
describe(iris[1:4])

#随机拆分数据，一部分用于train，一部分用于test
#暂时不知道是否存在像Sklearn中的现成的random_split，我们自己来实现一个近似版本
function random_split(inputData,randomNum)
    train = DataFrame()
    test = DataFrame()
    for i in 1:size(inputData,1)
        chooseRandomNum = rand()
        if chooseRandomNum <= randomNum
           train = [train;inputData[i,:]]
        else
           test = [test;inputData[i,:]]
        end
    end
    return train,test
end

train, test = random_split(iris, 0.8)
x_train = convert(Array, train[:, 1:4])
y_train = convert(Array, train[:, 5])
x_test = convert(Array, test[:, 1:4])
y_test = convert(Array, test[:, 5])

#创建决策树模型
model = build_tree(y_train,x_train)
#为防止过拟合，可进行一定的剪枝
model = prune_tree(model, 0.8)
#对测试集进行分类
y_predict = apply_tree(model,x_test)

#计算预测准确率
println("准确率为：")
sum([y_predict[i] == y_test[i] for i in 1:size(y_predict,1)])/size(y_predict,1)

上一篇：padas中DataFrame之模糊查询
下一篇：python库pandas的to_csv()使用方法

One - One Code All

Blog Content

数据科学工具箱之julia篇

The minute you think of giving up, think of the reason why you held on so long.