Reference
Contents
Algorithms
DPMM.fit
— Function.fit(X::AbstractMatrix; algorithm=DEFAULT_ALGO, ncpu=1, T=3000, benchmark=false, scene=nothing, o...)
fit
is the main function of DPMM.jl which clusters given data matrix where columns are data points.
The output is the labels for each data point.
Default clustering algorithm is SplitMergeAlgorithm
Keywords:
ncpu=1
: the number of parallel workers.T=3000
: iteration countbenchmarks=false
: if true returns elapsed timescene=nothing
: plot scene for visualization. seesetup_scene
o... : other keyword argument specific to
algorithm
DPMM.DPMMAlgorithm
— Type.DPMMAlgorithm{P}
Abstract base class for algorithms
P
stands for parallel.
Each subtype should provide the following methods:
AlgoType(X::AbstractMatrix; o...)
` : constructorrandom_labels(X::AbstractMatrix,algo::AlgoType{P}) where P
: random label generatorcreate_clusters(X::AbstractMatrix,algo::AlgoType{P},labels) where P
: initial clustersempty_cluster(algo::AlgoType) where P
: an empty cluster (may be nothing)run!(algo::AlgoType{P}, X, labels, clusters, emptycluster;o...) where P
: run! modifies labels
Other generic functions is implemented on top of these core functions.
DPMM.CollapsedAlgorithm
— Type.CollapsedAlgorithm{P,Q} <: DPMMAlgorithm{P}
Run it by:
labels = fit(X; algorithm = CollapsedAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)
P
stands for parallel, Q
stands for quasi. Quasi algorithm updates the clusters only in the end of each iteration. Parallel algorithm is valid for quasi-collapsed algorithm only. The number of workers can passed by ncpu
keyword argument to fit
or run!
functions
Provides following methods:
CollapsedAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::CollapsedAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::CollapsedAlgorithm,labels) where P
empty_cluster(algo::CollapsedAlgorithm) where P : an empty cluster
run!(algo::CollapsedAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.DirectAlgorithm
— Type.DirectAlgorithm{P,Q} <: DPMMAlgorithm{P}
Run it by:
labels = fit(X; algorithm = DirectAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)
P
stands for parallel, Q
stands for quasi. Quasi algorithm uses cluster population proportions as cluster weights. So, it doesn't sample mixture weights from Dirichlet distribution. In large N
, this is very similar to non-quasi sampler. The number of workers can passed by ncpu
keyword argument to fit
or run!
functions
Provides following methods:
DirectAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::DirectAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::DirectAlgorithm,labels) where P
empty_cluster(algo::DirectAlgorithm) where P : an empty cluster
run!(algo::DirectAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.SplitMergeAlgorithm
— Type.SplitMergeAlgorithm{P,Q} <: DPMMAlgorithm{P}
Run it by:
labels = fit(X; algorithm = SplitMergeAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)
P
stands for parallel, Q
stands for quasi. M=false
algorithm doesn't do merge moves at all, so it is not exact However, emprical results shows that merge moves very less likely. The number of workers can passed by ncpu
keyword argument to fit
or run!
functions
Provides following methods:
SplitMergeAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::SplitMergeAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::SplitMergeAlgorithm,labels) where P
empty_cluster(algo::SplitMergeAlgorithm) where P : an empty cluster
run!(algo::SplitMergeAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.run!
— Function.run!(algo::DPMMAlgorithm, X, labels, clusters, emptycluster;o...)
Runs the specified Gibbs algorithm. Availables algorithms are:
Collapsed Algorithms
DirectAlgorithm
SplitMergeAlgorithm
DPMM.setup_workers
— Function.setup_workers(ncpu::Integer)
Setup parallel process, initialize required modules
DPMM.initialize_clusters
— Function.initialize_clusters(X::AbstractMatrix, algo::DPMMAlgorithm{P}
Initialize clusters and labels, sends related data to workers if the algorithm is parallel
Algorithms (Internal)
DPMM.random_labels
— Function.random_labels(X::AbstractMatrix, algo::DPMMAlgorithm)
random label generator for the data. algo.ninit specifies number of clusters
DPMM.create_clusters
— Function.create_clusters(X,algo::CollapsedAlgorithm,labels)
generate clusters from labels generator for the data. algo.ninit specifies number of clusters
DPMM.empty_cluster
— Function.empty_cluster(X,algo::CollapsedAlgorithm,labels)
generates an empty (0 data points) cluster
DPMM.RestrictedClusterProbs
— Function.RestrictedClusterProbs(πs::AbstractVector{V}, clusters::Dict, x::AbstractVector) where V<:Real
Returns normalized probability vector for a data point being any cluster
DPMM.CRPprobs
— Function.CRPprobs(clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real
Returns Chineese Restraunt Probabilities for a data point being any cluster + a new cluster
DPMM.SampleSubCluster
— Function. SampleSubCluster(πs::Vector{V}, cluster::SplitMergeCluster, x::AbstractVector) where V<:Real
Returns normalized probability vector for a data point being right or left subcluster
DPMM.ClusterProbs
— Function.ClusterProbs(πs::AbstractVector{V}, clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real
Returns normalized probability vector for a data point being any cluster + a new cluster
DPMM.place_x!
— Function.place_x!(model::AbstractDPModel,clusters::Dict,knew::Int,xi::AbstractVector)
Place a data point to its new cluster. This modifies `clusters`
DPMM.label_x
— Function.label_x(clusters::Dict,knew::Int)
Return new cluster number for a data point
DPMM.logmixture_πs
— Function.logmixture_πs(α::V, clusters::Dict{<:Integer, <:AbstractCluster}) where V<:Real
Sample log mixture weights from Dirichlet Distribution.
Distributions
DPMM.NormalInverseWishart
— Type.NormalInverseWishart{T<:Real,S<:AbstractPDMat} <: ContinuousUnivariateDistribution
Normal Inverse Wishart distribution is prior for MvNormalFast
distribution.
see MvNormalFast
DPMM.MvNormalFast
— Type.MvNormalFast{T<:Real,Prec<:AbstractPDMat,Mean<:AbstractVector} <: AbstractMvNormal
Normal distribution is redifined for the purpose of fast likelihood calculations.
It uses μ(mean), J (precision) parametrization.
DPMM.DirichletFast
— Type.DirichletFast{T<:Real} <: ContinuousMultivariateDistribution
Dirichlet distribution as a prior to multinomial parameters.
The difference between DirichletFast
and Dirichlet
is that randn
returns MultinomialFast
distribution in DirichletFast
.
It also does not calculate normalization constant at any time, so it has faster constructor than Dirichlet
.
see MultinomialFast
DPMM.MultinomialFast
— Type.MultionmialFast{T<:Real} <: ContinuousMultivariateDistribution
Multinomial distribution is redifined for the purpose of fast likelihood calculations on DPSparseVector
.
The other difference between MultinomialFast
and Multionomial
is that The n
: trial numbers is not set. It is calculated by the input vector in the pdf function. So, it can produce pdf for any discrete x
vector.
Clusters
DPMM.AbstractCluster
— Type.AbstractCluster
Abstract base class for clusters
Each subtype should provide the following methods:
population(c)
: population of the clusterisempty(m::AbstractCluster)
: checks whether the cluster is empty?logαpdf(c,x)
: log(∝likelihood) of a data pointlognαpdf(c,x)
: log(population) + logαpdf(c,x) for a data point (used in CRP calculations)ClusterType(m::AbstractDPModel,X::AbstractArray)
: constructor (X is the data as columns)ClusterType(m::AbstractDPModel,s::SufficientStats)
: constructor
Other generic functions are implemented on top of these core functions.
DPMM.lognαpdf
— Function.lognαpdf(m::AbstractCluster,x::AbstractArray)
log(population) + log(∝likelihood) of a data point given by a cluster.
DPMM.logαpdf
— Function.logαpdf(m::AbstractCluster,x::AbstractArray)
log(∝likelihood) of a data point given by a cluster.
DPMM.population
— Function.population(m::AbstractCluster)
Number of data points in a cluster
DPMM.CollapsedCluster
— Type.CollapsedCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster
The CollapsedCluster is designed for Collapsed Gibbs algorithms.
CollapsedCluster has below fields: - n
: population - predictive
: predictive distribution - prior
: prior distribution
A CollapsedCluster are constructed via SufficientStats or data points:
CollapsedCluster(m::AbstractDPModel, X::AbstractArray) # X is the data as columns
CollapsedCluster(m::AbstractDPModel, s::SufficientStats)
There is also generic(not specific to CollapsedCluster) SuffStats method for getting suffstats for whole data as a dictionary:
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})
There are also specific methods defined for creating clusters for whole data as a dictionary:
CollapsedClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
CollapsedClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})
-
and +
operations are defined for data addition and data removal from the cluster:
-(c::CollapsedCluster, x::AbstractVector)
+(c::CollapsedCluster, x::AbstractVector)
see AbstractCluster
for generic functions for all Cluster types.
DPMM.DirectCluster
— Type.DirectCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster
The DirectCluster is designed for Direct Gibbs algorithms.
DirectCluster has below fields: n
: population sampled
: sampled parameter distribution 'prior' : prior distribution
A DirectCluster are constructed via SufficientStats or data points:
DirectCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
DirectCluster(m::AbstractDPModel,s::SufficientStats)
There is also generic(not specific to DirectCluster) SuffStats method for getting sufficient stats for whole data data as a dictionary
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})
There are also specific methods defined for creating clusters for whole data as a dictionary:
DirectClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
DirectClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})
see AbstractCluster
for generic functions for all Cluster types.
DPMM.SplitMergeCluster
— Type.SplitMergeCluster{Pred<:Distribution, Post<:Distribution, Prior<:Distribution} <: AbstractCluster
The SplitMergeCluster is designed for Split-Merge Gibbs algorithm.
SplitMergeCluster has below fields: - n
: population - nr
: right subcluster population - nl
: left subcluster population - sampled
: sampled parameter distribution - right
: right subcluster sampled parameter distribution - left
: left subcluster sampled parameter - post
: posterior distributions - rightpost
: right subcluster posterior distributions - leftpost
: left subcluster posterior distributions - 'prior' : prior distribution - llhs
: log marginal likelihoods assigned by cluster, right subcluster, leftsubcluster - llh_hist
: right + left log marginal likelihood history over 4 iteration - 'prior' : prior distribution
A SplitMergeCluster are constructed via SufficientStats or data points:
SplitMergeCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
SplitMergeCluster(m::AbstractDPModel,s::SufficientStats)
There is also generic SuffStats method for getting sufficient stats for whole data:
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractVector{Tuple{Int,Bool}})
There are also specific methods defined for creating clusters for whole data:
SplitMergeClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractVector{Tuple{Int,Bool}})
see AbstractCluster
for generic functions for all Cluster types.
The logαpdf
and lognαpdf
generic functions are extended for subcluster likelihoods.
logαpdf(m::SplitMergeCluster,x,::Val{false}) # right subcluster likelihood
logαpdf(m::SplitMergeCluster,x,::Val{true}) # left subcluster likelihood
lognαpdf(m::SplitMergeCluster, x, ::Val{false}) = log(population(m,Val(false))) + logαpdf(m, x, Val(false))
lognαpdf(m::SplitMergeCluster, x, ::Val{true}) = log(population(m,Val(true))) + logαpdf(m, x, Val(true))
Models
DPMM.AbstractDPModel
— Type.AbstractDPModel{T,D}
Abstract base class for DPMMs
T
stands for element type, D
is for dimensionality of the data
DPMM.DPGMM
— Type.DPGMM{T<:Real,D} <: AbstractDPModel{T,D}
Class for DP Gaussian Mixture Models
DPMM.DPMNMM
— Type.DPMNMM{T<:Real,D} <: AbstractDPModel{T,D}
Class for DP Multinomial Mixture Models
DPMM.DPGMMStats
— Type.DPGMMStats{T<:Real} <: SufficientStats
Sufficient statistics for Gaussian Models
DPMM.DPMNMMStats
— Type.DPMNMMStats{T<:Real} <: SufficientStats
Sufficient statistics for Multinomial Models
Data
DPMM.setup_scene
— Function.setup_scene(X)
Initialize plots for visualizing 2D data
DPMM.readNYTimes
— Function.readNYTimes(file::AbstractString)
Read NYTimes dataset from given data file. It returns DPSparseMatrix
DPMM.GridMixture
— Function.GridMixture(L::Integer; πs::Vector{T}=ones(L*L)/(L*L)) where T<:Real
Generates LxL grid Gaussians
DPMM.RandMixture
— Function.RandMixture(K::Integer;D::Int=2,πs::Vector{T}=ones(K)/K) where T<:Real
Randomly generates K Gaussian
DPMM.DPSparseMatrix
— Type.DPSparseMatrix(X::SparseMatrixCSC{Tv,Ti}) where {Tv,Ti}
DPSparseMatrix has fast getindex methods for column indexing (i.e X[:,i]) It also doesn't copy column and return DPSparseVector
for a column indexing.
see DPSparseVector
DPMM.DPSparseVector
— Type.DPSparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}
DPSparseVector
is almost same with SparseArrays.SparseVector
The only difference is summation between DPSparseVector
s results with a Vector
.
Function Index
DPMM.AbstractCluster
DPMM.AbstractDPModel
DPMM.CollapsedAlgorithm
DPMM.CollapsedCluster
DPMM.DPGMM
DPMM.DPGMMStats
DPMM.DPMMAlgorithm
DPMM.DPMNMM
DPMM.DPMNMMStats
DPMM.DPSparseMatrix
DPMM.DPSparseVector
DPMM.DirectAlgorithm
DPMM.DirectCluster
DPMM.DirichletFast
DPMM.MultinomialFast
DPMM.MvNormalFast
DPMM.NormalInverseWishart
DPMM.SplitMergeAlgorithm
DPMM.SplitMergeCluster
DPMM.CRPprobs
DPMM.ClusterProbs
DPMM.GridMixture
DPMM.RandMixture
DPMM.RestrictedClusterProbs
DPMM.SampleSubCluster
DPMM.create_clusters
DPMM.empty_cluster
DPMM.fit
DPMM.initialize_clusters
DPMM.label_x
DPMM.logmixture_πs
DPMM.lognαpdf
DPMM.logαpdf
DPMM.place_x!
DPMM.population
DPMM.random_labels
DPMM.readNYTimes
DPMM.run!
DPMM.setup_scene
DPMM.setup_workers