julia 提供處理大量資料,以及進行統計運算的 library
Working with data
首先是讀取資料的方式,如果是 binary 資料,可直接用 read(), write()
julia> for val in [:Bool, :Char, :Int8]
         @eval println(@which read(stdin, $val))
       end
read(s::IO, ::Type{Bool}) in Base at io.jl:593
read(io::IO, ::Type{Char}) in Base at io.jl:613
read(s::IO, ::Type{Int8}) in Base at io.jl:588使用
julia> read(stdin, Char)
j
'j': ASCII/Unicode U+006a (category Ll: Letter, lowercase)
julia> read(stdin, Bool)
true
true
julia> read(stdin, Int8)
12
49readline
julia> methods(readline)
# 4 methods for generic function "readline":
[1] readline(s::IOStream; keep) in Base at iostream.jl:433
[2] readline() in Base at io.jl:370
[3] readline(filename::AbstractString; keep) in Base at io.jl:364
[4] readline(s::IO; keep) in Base at io.jl:370
使用
julia> number = parse(Int64, readline())
12
12
julia> println(number)
12working with text files
open 參數除了 filename,後面還有存取該檔案的 mode
julia> methods(open)
# 8 methods for generic function "open":
[1] open(fname::AbstractString; read, write, create, truncate, append) in Base at iostream.jl:275
[2] open(fname::AbstractString, mode::AbstractString) in Base at iostream.jl:339
[3] open(f::Function, cmds::Base.AbstractCmd, args...) in Base at process.jl:615
[4] open(f::Function, args...; kwargs...) in Base at iostream.jl:367
[5] open(cmds::Base.AbstractCmd) in Base at process.jl:584
[6] open(cmds::Base.AbstractCmd, mode::AbstractString) in Base at process.jl:562
[7] open(cmds::Base.AbstractCmd, mode::AbstractString, other::Union{RawFD, FileRedirect, IO}) in Base at process.jl:562
[8] open(cmds::Base.AbstractCmd, other::Union{RawFD, FileRedirect, IO}; write, read) in Base at process.jl:584
| Mode | Description | 
|---|---|
| r | read | 
| r+ | read, write | 
| w | write, create, truncate | 
| w+ | read, write, create, truncate | 
| a | write, create, append | 
| a+ | read, write, create, append | 
# 有個文字檔 sample.txt
shell> cat sample.txt
file = open("sample.txt")
# 以行為單位,讀入文字檔
file_data = readlines(file)
enumerate(file_data)
# 加上行號
for lines in enumerate(file_data)
   println(lines[1],"-> ", lines[2])
end
# 大寫列印
for line in file_data
   println(uppercase(line))
end
# 反過來列印
for line in file_data
   println(reverse(line))
end
# 計算行數
countlines("sample.txt")
# 列印第一行
first(file_data)
# 最後一行
last(file_data)working with CSV
using DelimitedFiles
csvfile = readdlm("sample.csv", ',', String, '\n')
# 只取得第二欄資料
csvfile[:,2]
# 只取前三行資料
csvfile[1:3,:]
# 以 | 區隔資料的 sample.psv
# "Lora"|"UK"
# "James"|"UK"
readdlm("sample.psv",'|')working with DataFrames
DataFrame 是一種類似 database 的資料結構,可以用類似 SQL 的方式存取 DataFrames
using DataFrames
# 取得所有 DataFrames 支援的 functions
names(DataFrames)DataFrame: 2D data structure,類似 R 與 Pandas 的 DataFrame
- DataArray: 比 julia 預設 Array type 提供更多 comparison 的功能,但目前已經 deprecated (ref: https://github.com/JuliaStats/DataArrays.jl)
- DataFrame: 2D data structure,類似 R 與 Pandas 的 DataFrame
julia 的 Array 以 nothing 代表 missing value
julia> a = [1,2,3,nothing,5,6]
6-element Array{Union{Nothing, Int64},1}:
 1
 2
 3
  nothing
 5
 6
julia> typeof(a[4])
NothingDataFrame: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started.html
julia> dframe = DataFrame(Names = ["John","May"], Age = [27,28])
2×2 DataFrame
│ Row │ Names  │ Age   │
│     │ String │ Int64 │
├─────┼────────┼───────┤
│ 1   │ John   │ 27    │
│ 2   │ May    │ 28    │
julia> dframe.Names
2-element Array{String,1}:
 "John"
 "May"
julia> dframe.Age
2-element Array{Int64,1}:
 27
 28
julia> names(dframe)
2-element Array{Symbol,1}:
 :Names
 :Age
julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean   │ min  │ median │ max │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any  │ Union… │ Any │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼──────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1   │ Names    │        │ John │        │ May │ 2       │          │ String   │
│ 2   │ Age      │ 27.5   │ 27   │ 27.5   │ 28  │         │          │ Int64    │另外有個獨立的 CSV 套件,可處理 csv file,讀入 CSV file 後,就得到 DataFrame 物件
using Pkg
Pkg.add("CSV")
using CSVjulia> CSV.read("sample.csv")
3×2 DataFrame
│ Row │ Lora    │ UK      │
│     │ String⍰ │ String⍰ │
├─────┼─────────┼─────────┤
│ 1   │ James   │ UK      │
│ 2   │ Raj     │ India   │
│ 3   │ May     │ America │Linear algebra
產生 matrix
# 亂數 3x3 matrix
julia> A = rand(3,3)
3×3 Array{Float64,2}:
 0.108282  0.433033  0.247145
 0.571936  0.369386  0.547845
 0.423382  0.380503  0.77661
# array 且初始為 1
julia> ones(5)
5-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0Vector 與 Matrix 的差異
Vector{T} 是 Array{T, 1} 的 alias,Matrix{T} 是Array{T, 2} 的 alias
julia> Vector{Float64} == Array{Float64,1}
true
julia> Matrix{Float64} == Array{Float64,2}
true矩陣乘法
julia> A = rand(3,3)
3×3 Array{Float64,2}:
 0.167005  0.188954  0.0483164
 0.82603   0.440255  0.00107621
 0.137211  0.308508  0.724953
julia> b = 2*A
3×3 Array{Float64,2}:
 0.334011  0.377909  0.0966328
 1.65206   0.88051   0.00215241
 0.274422  0.617015  1.44991
julia> b= 2A
3×3 Array{Float64,2}:
 0.334011  0.377909  0.0966328
 1.65206   0.88051   0.00215241
 0.274422  0.617015  1.44991transpose 轉置矩陣
julia> transpose_of_A = A'
3×3 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 0.167005   0.82603     0.137211
 0.188954   0.440255    0.308508
 0.0483164  0.00107621  0.724953inverse 逆矩陣
julia> inv(A)
3×3 Array{Float64,2}:
 -6.31559   2.41816    0.417329
 11.8591   -2.26692   -0.787013
 -3.85134   0.507016   1.63533Determinant 行列式
julia> using LinearAlgebra
julia> det(A)
-0.05048337143385562Eigenvalues 特徵向量
julia> eigvals(A)
3-element Array{Float64,1}:
 -0.1016764648907107
  0.5846559789763167
  0.8492342814186749方程式運算
julia> 5
5
julia> equation = 3x^2 + 4x + 3
98Differential calculus
Pkg.add("Calculus")
using Calculus
names(Calculus)julia> f(x) = sin(x)
f (generic function with 1 method)
        
julia> f'(1.0)
0.5403023058631036
julia> f'(1.0) - cos(1.0)
-5.036193684304635e-12
julia> f''(1.0) - (-sin(1.0))
-6.647716624952338e-7
julia> differentiate("cos(x) + sin(x) + exp(-x) * cos(x)", :x)
:(1 * -(sin(x)) + 1 * cos(x) + ((-1 * exp(-x)) * cos(x) + exp(-x) * (1 * -(sin(x)))))
julia> differentiate("sin(x)", :x)
:(1 * cos(x))
julia> differentiate("cos(x) + sin(y) + exp(-x) * cos(y)", [:x, :y])
2-element Array{Any,1}:
 :(1 * -(sin(x)) + ((-1 * exp(-x)) * cos(y) + exp(-x) * 0))
 :(1 * cos(y) + (0 * cos(y) + exp(-x) * (1 * -(sin(y)))))Statistics
JuliaStats 所列出支援的 packages
- DataFrames
- Distributions: probability distribution
- MultivariateStats: Multivariate statistical analysis
- HypothesisTests
- MLBase: Swiss knife for machine learning
- Distances: Various distances between vectors
- KernelDensity
- Clustering: algorithm for data clustering
- GLM: Generalized linear models
- NMF: Nonnegative matrix factorization
- RegERMs: Regularized empirical risk minimization
- MCMC: Markov Chain Monte Carlo
- TimeSeries: time series analysis
simple statistics
julia> using Statistics
julia> x = [10,20,30,40,50]
julia> mean(x)
30.0
# 中位數 median
julia> median(x)
30.0
julia> sum(x)
150
# 標準差
julia> std(x)
15.811388300841896
# variance 變異數
julia> var(x)
250.0julia 也支援 cumulative operations 的 functions
- accumulate
- cumsum(): cumulative summation
- cumprod(): cumulative product
julia> accumulate(+, x)
5-element Array{Int64,1}:
  10
  30
  60
 100
 150
julia> accumulate(-, x)
5-element Array{Int64,1}:
   10
  -10
  -40
  -80
 -130
julia> cumsum(x)
5-element Array{Int64,1}:
  10
  30
  60
 100
 150
julia> cumprod(x)
5-element Array{Int64,1}:
       10
      200
     6000
   240000
 12000000
julia> for i in [:cumsum,:cumprod]
               @eval print($i, "->")
               @eval println($i(x))
       end
cumsum->[10, 30, 60, 100, 150]
cumprod->[10, 200, 6000, 240000, 12000000]statistics using DataFrames
julia> using DataFrames
julia> dframe = DataFrame(Subjects = ["Maths","Physics","Chemistry"],Marks = [90,85,95])
3×2 DataFrame
│ Row │ Subjects  │ Marks │
│     │ String    │ Int64 │
├─────┼───────────┼───────┤
│ 1   │ Maths     │ 90    │
│ 2   │ Physics   │ 85    │
│ 3   │ Chemistry │ 95    │
julia> describe(dframe)
2×8 DataFrame
│ Row │ variable │ mean   │ min       │ median │ max     │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any       │ Union… │ Any     │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼───────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ Subjects │        │ Chemistry │        │ Physics │ 3       │          │ String   │
│ 2   │ Marks    │ 90.0   │ 85        │ 90.0   │ 95      │         │          │ Int64    │Pandas
對習慣使用 python 進行資料分析的使用者來說,Pandas 是比較常用的 package
julia> pandasDataframe = Pandas.DataFrame(Dict(:Subjects => ["Maths","Physics","Chemistry"],:Marks => [90,85,95]))
   Marks   Subjects
0     90      Maths
1     85    Physics
2     95  Chemistry
julia> Pandas.describe(pandasDataframe)
       Marks
count    3.0
mean    90.0
std      5.0
min     85.0
25%     87.5
50%     90.0
75%     92.5
max     95.0
julia> pandasDataframe[:Subjects]
0        Maths
1      Physics
2    Chemistry
Name: Subjects, dtype: object
julia> pandasDataframe[:Marks]
0    90
1    85
2    95
Name: Marks, dtype: int64
julia> query(pandasDataframe,:(Marks>90))
   Marks   Subjects
2     95  ChemistryDistributions
distributions.jl 有以下功能
- Moments (ex: mean, variance, skewness, and kurtosis), entropy, and other properties
- Probability density/mass functions (pdf) and their logarithms (logpdf)
- Moment generating functions and characteristic functions
- Maximum likelihood estimation
- Posterior w.r.t. conjugate prior and Maximum-A-Posteriori (MAP) estimation
using Pkg
Pkg.add("Distributions")
using Distributions
distribution = Normal()
# 以 normal distribution 產生 10 筆 random 資料
x = rand(distribution, 10)
Binomial()
Cauchy()
Poisson()TimeSeries
Pkg.add("TimeSeries")
Pkg.add("MarketData")
using TimeSeriesjulia> dates  = collect(Date(2017,8,1):Day(1):Date(2017,8,5))
5-element Array{Date,1}:
 2017-08-01
 2017-08-02
 2017-08-03
 2017-08-04
 2017-08-05
julia> sample_time = TimeArray(dates, rand(length(dates)))
5×1 TimeArray{Float64,1,Date,Array{Float64,1}} 2017-08-01 to 2017-08-05
│            │ A      │
├────────────┼────────┤
│ 2017-08-01 │ 0.9142 │
│ 2017-08-02 │ 0.0834 │
│ 2017-08-03 │ 0.417  │
│ 2017-08-04 │ 0.4778 │
│ 2017-08-05 │ 0.6859 │Hypothesis testing
測試 sample data 是否有足夠的 evidence,針對整個 population of data 進行條件測試。
有兩種 hypothesis testing 測試方式
- Null hypothesis: the statement being tested
- Alternative hypothesis: the statement that you will be able to conclude as true
Pkg.add("HypothesisTests")
using HypothesisTests
using Distributionsjulia> sampleOne = rand(Normal(), 10)
10-element Array{Float64,1}:
 -0.5347603532683008
 -0.7882658160278007
 -0.3314222340035204
 -0.9280782524368908
  0.4540819395751322
  1.056554282551302
  1.2211198338710754
 -0.7067184060685551
  0.9788402691232629
  0.39421862072760827
julia> testOne = OneSampleTTest(sampleOne)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.0815569884043313
    95% confidence interval: (-0.514, 0.6771)
Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7638
Details:
    number of observations:   10
    t-statistic:              0.3097695342286937
    degrees of freedom:       9
    empirical standard error: 0.2632827937950805
julia> @which OneSampleTTest(sampleOne)
OneSampleTTest(v::AbstractArray{T,1}) where T<:Real in HypothesisTests at /Users/charley/.julia/packages/HypothesisTests/M3Ysg/src/t.jl:98
julia> pvalue(testOne)
0.7637885831202397
julia> pvalue(testOne, tail=:right)
0.38189429156011984
julia> pvalue(testOne, tail=:left)
0.6181057084398802
# check whether 25 successes from 1000 samples is inconsistent with a 50% success rate
julia> BinomialTest(25, 1000, 0.50)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.025
    95% confidence interval: (0.0162, 0.0367)
Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99
Details:
    number of observations: 1000
    number of successes:    25
Optimization
optimization 是 finding best solution from all feasible solutions 的問題,可分為兩類
- continuous optimization problem
- combinatorial optimization problem (discrete)
某些問題可以用 optimization 方法解決
- Shortest path
- Maximum flow through a network
- Vehicle routing
juila 將 optimization packages 集合在 JuliaOpt,其中有兩個最重要的 packages,這兩個都是 AML(Algebraic modeling languages),放在 MathProgBase.jl 裡面
- JuMP (Julia for Mathematical Programming)
- Convex.jl
JuMP
Python 習慣使用 PuLP
Pkg.add("JuMP")
Pkg.add("Clp")
using JuMP
using Clpjulia> m = JuMP.Model()
Feasibility problem with:
 * 0 linear constraints
 * 0 variables
Solver is default solver
julia> m = JuMP.Model(solver = Clp.ClpSolver())
Feasibility problem with:
 * 0 linear constraints
 * 0 variables
Solver is Clpoptimiser.jl
using JuMP
using Clp
m = Model(solver = ClpSolver())
@variable(m, 0 <= a <= 2 )
@variable(m, 0 <= b <= 10 )
@objective(m, Max, 5a + 3*b )
@constraint(m, 1a + 5b <= 3.0 )
print(m)
status = solve(m)
println("Objective value: ", getobjectivevalue(m))
println("a = ", getvalue(a))
println("b = ", getvalue(b))執行
$ julia optimiser.jl
Max 5 a + 3 b
Subject to
 a + 5 b ≤ 3
 0 ≤ a ≤ 2
 0 ≤ b ≤ 10
Objective value: 10.6
a = 2.0
b = 0.2Convex.jl
用在 disciplined convex programming,使用 Convex 需要 solver,也就是 SCS
Pkg.add("Convex")
Pkg.add("SCS")
using Convex
using SCS
X = Variable(2, 2)
y = Variable()
p = minimize(vecnorm(X) + y, 2 * X <= 1, X' + y >= 1, X >= 0, y >= 0)
solve!(p)
println(X.value)
println(y.value)
p.optval 
沒有留言:
張貼留言