Use in Dirichlet mixture models Dirichlet process



simulation of 1000 observations drawn dirichlet mixture model. each observation within cluster drawn independently multivariate normal distribution



n
(

μ

k


,
1

/

4
)


{\displaystyle n(\mu _{k},1/4)}

. cluster means




μ

k




{\displaystyle \mu _{k}}

drawn distribution g drawn dirichlet process concentration parameter



α
=
0.5


{\displaystyle \alpha =0.5}

, base distribution



h
=
n
(
2
,
16
)


{\displaystyle h=n(2,16)}

. each row new simulation.


to understand dirichlet processes , problem solve consider example of data clustering. common situation data points assumed distributed in hierarchical fashion each data point belongs (randomly chosen) cluster , members of cluster further distributed randomly within cluster.


example 1

for example, might interested in how people vote on number of questions in upcoming election. reasonable model situation might classify each voter liberal, conservative or moderate , model event voter says “yes” particular question bernoulli random variable probability dependent on political cluster belong to. looking @ how votes cast in previous years on similar pieces of legislation 1 fit predictive model using simple clustering algorithm such k-means. algorithm, however, requires knowing in advance number of clusters generated data. in many situations not possible determine ahead of time, , when can reasonably assume number of clusters still able check assumption. example, in voting example above division liberal, conservative , moderate might not finely tuned enough; attributes such religion, class or race critical modeling voter behavior.


example 2

as example, might interested in modeling velocities of galaxies using simple model assuming velocities clustered, instance assuming each velocity distributed according normal distribution




v

i



n
(

μ

k


,

σ

2


)


{\displaystyle v_{i}\sim n(\mu _{k},\sigma ^{2})}

,



i


{\displaystyle i}

th observation belongs



k


{\displaystyle k}

th cluster of galaxies common expected velocity. in case far obvious how determine priori how many clusters (of common velocities) there should , model highly suspect , should checked against data. using dirichlet process prior distribution of cluster means circumvent need explicitly specify ahead of time how many clusters there are, although concentration parameter still controls implicitly.


we consider example in more detail. first naive model presuppose there



k


{\displaystyle k}

clusters of distributed velocities common known fixed variance




σ

2




{\displaystyle \sigma ^{2}}

. denoting event



i


{\displaystyle i}

th observation in



k


{\displaystyle k}

th cluster




z

i


=
k


{\displaystyle z_{i}=k}

can write model as:











(

v

i




z

i


=
k
,

μ

k


)




n
(

μ

k


,

σ

2


)





p

(

z

i


=
k
)



=

π

k






(

π


α
)





d
i
r


(


α
k





1


k


)






μ

k






h
(
λ
)






{\displaystyle {\begin{aligned}(v_{i}\mid z_{i}=k,\mu _{k})&\sim n(\mu _{k},\sigma ^{2})\\\mathrm {p} (z_{i}=k)&=\pi _{k}\\({\boldsymbol {\pi }}\mid \alpha )&\sim \mathrm {dir} \left({\frac {\alpha }{k}}\cdot \mathbf {1} _{k}\right)\\\mu _{k}&\sim h(\lambda )\end{aligned}}}



that is, assume data belongs



k


{\displaystyle k}

distinct clusters means




μ

k




{\displaystyle \mu _{k}}

,




π

k




{\displaystyle \pi _{k}}

(unknown) prior probability of data point belonging



k


{\displaystyle k}

th cluster. assume have no initial information distinguishing clusters, captured symmetric prior




d
i
r


(
α

/

k



1


k


)



{\displaystyle \mathrm {dir} \left(\alpha /k\cdot \mathbf {1} _{k}\right)}

. here




d
i
r



{\displaystyle \mathrm {dir} }

denotes dirichlet distribution ,





1


k




{\displaystyle \mathbf {1} _{k}}

denotes vector of length



k


{\displaystyle k}

each element 1. further assign independent , identical prior distributions



h
(
λ
)


{\displaystyle h(\lambda )}

each of cluster means,



h


{\displaystyle h}

may parametric distribution parameters denoted



λ


{\displaystyle \lambda }

. hyper-parameters



α


{\displaystyle \alpha }

,



λ


{\displaystyle \lambda }

taken known fixed constants, chosen reflect our prior beliefs system. understand connection dirichlet process priors rewrite model in equivalent more suggestive form:











(

v

i







μ
~




i


)




n
(




μ
~




i


,

σ

2


)








μ
~




i






g
=



k
=
1


k



π

k



δ


μ

k




(




μ
~




i


)




(

π


α
)





d
i
r


(


α
k





1


k


)






μ

k






h
(
λ
)






{\displaystyle {\begin{aligned}(v_{i}\mid {\tilde {\mu }}_{i})&\sim n({\tilde {\mu }}_{i},\sigma ^{2})\\{\tilde {\mu }}_{i}&\sim g=\sum _{k=1}^{k}\pi _{k}\delta _{\mu _{k}}({\tilde {\mu }}_{i})\\({\boldsymbol {\pi }}\mid \alpha )&\sim \mathrm {dir} \left({\frac {\alpha }{k}}\cdot \mathbf {1} _{k}\right)\\\mu _{k}&\sim h(\lambda )\end{aligned}}}



instead of imagining each data point first assigned cluster , drawn distribution associated cluster think of each observation being associated parameter







μ
~




i




{\displaystyle {\tilde {\mu }}_{i}}

drawn discrete distribution



g


{\displaystyle g}

support on



k


{\displaystyle k}

means. is, treating







μ
~




i




{\displaystyle {\tilde {\mu }}_{i}}

being drawn random distribution



g


{\displaystyle g}

, our prior information incorporated model distribution on distributions



g


{\displaystyle g}

.




animation of clustering process one-dimensional data using gaussian distributions drawn dirichlet process. histograms of clusters shown in different colours. during parameter estimation process, new clusters created , grow on data. legend shows cluster colours , number of datapoints assigned each cluster.


we extend model work without pre-specifying fixed number of clusters



k


{\displaystyle k}

. mathematically, means select random prior distribution



g
(




μ
~




i


)
=



k
=
1






π

k



δ


μ

k




(




μ
~




i


)


{\displaystyle g({\tilde {\mu }}_{i})=\sum _{k=1}^{\infty }\pi _{k}\delta _{\mu _{k}}({\tilde {\mu }}_{i})}

values of clusters means




μ

k




{\displaystyle \mu _{k}}

again independently distributed according



h

(
λ
)



{\displaystyle h\left(\lambda \right)}

, distribution on




π

k




{\displaystyle \pi _{k}}

symmetric on infinite set of clusters. accomplished model:











(

v

i







μ
~




i


)




n
(




μ
~




i


,

σ

2


)








μ
~




i






g




g





d
p

(
h
(
λ
)
,
α
)






{\displaystyle {\begin{aligned}(v_{i}\mid {\tilde {\mu }}_{i})&\sim n({\tilde {\mu }}_{i},\sigma ^{2})\\{\tilde {\mu }}_{i}&\sim g\\g&\sim \mathrm {dp} (h(\lambda ),\alpha )\end{aligned}}}



with in hand can better understand computational merits of dirichlet process. suppose wanted draw



n


{\displaystyle n}

observations naive model



k


{\displaystyle k}

clusters. simple algorithm doing draw



k


{\displaystyle k}

values of




μ

k




{\displaystyle \mu _{k}}





h
(
λ
)


{\displaystyle h(\lambda )}

, distribution



π


{\displaystyle \pi }






d
i
r


(
α

/

k



1


k


)



{\displaystyle \mathrm {dir} \left(\alpha /k\cdot \mathbf {1} _{k}\right)}

, each observation independently sample cluster



k


{\displaystyle k}

probability




π

k




{\displaystyle \pi _{k}}

, value of observation according



n

(

μ

k


,

σ

2


)



{\displaystyle n\left(\mu _{k},\sigma ^{2}\right)}

. easy see algorithm not work in case allow infinite clusters because require sampling infinite dimensional parameter




π



{\displaystyle {\boldsymbol {\pi }}}

. however, still possible sample observations




v

i




{\displaystyle v_{i}}

. 1 can e.g. use chinese restaurant representation described below , calculate probability used clusters , new cluster created. avoids having explicitly specify




π



{\displaystyle {\boldsymbol {\pi }}}

. other solutions based on truncation of clusters: (high) upper bound true number of clusters introduced , cluster numbers higher lower bound treated 1 cluster.


fitting model described above based on observed data



d


{\displaystyle d}

means finding posterior distribution



p

(

π

,

μ


d
)



{\displaystyle p\left({\boldsymbol {\pi }},{\boldsymbol {\mu }}\mid d\right)}

on cluster probabilities , associated means. in infinite dimensional case impossible write down posterior explicitly. is, however, possible draw samples posterior using modified gibbs sampler. critical fact makes dirichlet process prior useful inference.








Comments

Popular posts from this blog

Independence United Arab Emirates

History Alexandra College

Management School of Computer Science, University of Manchester