인공지능 및 기계학습 개론 2(8.1~8.3) 정리

Dobby-HJ

·

2023. 10. 14. 01:10

8.1 ~ 2 K-Means Algorithm

Unsupervised Learning

  • You don’t know the true value, and you cannot provide examples of the true value.
  • Cases, such as
    • Discovering clusters
    • Discovering latent factors
    • Discovering graph strcutures
  • Clustering or filtering or completing of
    • Finding the representative topic words from text data
    • Finding the latent image from facial data
    • Completing the incomplete matrix of product-review scores
    • Filtering the noise from the trajectory data
  • Methodologies
    • Clustering: estimating sets and affiliations of instances to the sets
    • Filtering: estimating underlying and fundamental signals from the mixture of singals and noises

Clustering Problem

  • How to cluster the unlabeled data points
    • No concrete knowledge of their classes
    • Latent (hidden) variables of classes
    • Optimal assignment to the latent classes

K-Means Algorithm

  • K-Means algorithm
    • Setup $K$ number of centroids(or prototypes) and cluster data points by the distance from the points to the nearest centroid
  • K-Nearest는 주변의 $K$개의 원소의 label이 무엇인지를 확인해서, 나는 그 중 가장 많은 label을 가진 것을 prediction 결과로 하는 알고리즘 → 고로 supervised learning
  • Formally,
    • $J =\sum_{n=1}^N\sum_{k=1}^K r_{nk}||x_n-\mu_k||^2$ → 여기서 $\mu_k$는 centroid의 위치
    • Minimize $J$ by optimizing
      • $r_{nk}$ : the assignment of data points to clusters
        • 이 값에 0 또는 1값을 잘 주어서 위의 $J$를 잘 minimalize하는게 현재 Task의 방향성
        • 다른 말로 parameter, decision variable등으로도 표현할 수 있습니다.
      • $u_{k}$ : the location of centroids
    • Iterative optimization
      • Why?
      • Two variables are interacting

Expectation and Maximization

Expectation and Maximization을 EM algorithm이라고 한다.

Expectation Step과 Maximization Step을 번갈아가면서 Optimizing하는 기법입니다.

$K$-Mean Clustering 과정을 예로 생각해보겠습니다.

  1. 초기 셋팅 : $\mu_k$들을 Random하게 셋팅
  2. $r_{nk}$를 현재 셋팅된 $\mu_k$에 맞게 수정함.
  3. 현재 알게된 $r_{nk}$를 이용해 $\mu_k$를 수정함.
  4. 2~3 반복
  • $J = \sum_{n=1}^N \sum_{k=1}^K r_{nk}||x_n - \mu_k||^2$
    • Expectation
      • Expectation of the log-likelihood giben the parameters
      • Assign the data points to the nearest centroid
    • Maximization
      • Maximization of the parameters(여기서 parameter에는 $r_{nk}$도 있지만 centroid의 위치값도 됩니다.) with repect to the log-likelihood
      • Update the centroid positions given the assignments
  • $r_{nk}$
    • $r_{nk} = {0,1}$
    • Discrete variable
    • Logical choice : the nearest centroid $\mu_k$ for a data point of $x_n$
  • $\mu_k$
    • $\cfrac{\partial j}{\partial \mu_k} = \cfrac{\partial}{\partial \mu_k} \sum_{n=1}^N \sum_{k=1}^K r_{nk}||x_n - \mu_k||^2$ → 미분에 필요없는 $\sum_{k=1}^K$항을 없애 새로이
      $=\cfrac{\partial}{\partial \mu_k} \sum_{n=1}^N r_{nk}||x_n - \mu_k||^2 \= \sum_{n=1}^N - 2r_{nk}(x_n - u_k) \= -2 (-\sum_{n=1}^N r_{nk}\mu_k + \sum_{n=1}^N r_{nk}x_n) = 0$
    • $\mu_k = \cfrac{\sum_{n=1}^N r_{nk}x_n}{\sum_{n=1}^N r_{nk}}$ → 분자에는 assign된 datapoint들의 위치값들의 합, 분모에는 전체 데이터 포인터의 갯수 이므로 “위치 값들의 평균값”이 됨.

Progress of $K$-Means Algorithm

  • EM iterations to
    • Optimize the assignments with respect to the sum of distances
    • Optimize the parameters with respect to the sum of distance

  • Iteration이 지날수록 평균적인 원소들의 거리가 가까워 지도록 중심점(빨간색, 초록색, 파란색)이 이동중인 것을 볼 수 있다.

Properties of $K$-Means Algorithm

  • of clusters is uncertain

  • Initial location of centroids
    • Some inital locations might not result in the reasonable results
  • Limitation of distance metrics
    • Euclidean distance is very limitied knowledge of information
  • Hard clustering
    • Hard assignment of data points to clusters
      • $r_{nk} = {0,1}$
        • This can be the smoothly distributed probability
      • Any alternatives?
      • Soft clustering

8.3 Multinomial Distribution

Multinomial Distribution

  • Binary variable
    • Selecting 0 or 1 → binomial distribution
  • How about $K$ options?
    • $X = (0, 0, 1, 0, 0, 0)$ when $K = 6$ and selecting the third option
    • $\sum_k x_k = 1, P(X|\mu) = \Pi_{k=1}^K \mu_k^{x_k}$ such that $\mu_k \ge 0, \sum_k\mu_k = 1$
    • A generalization of binomial distribution → Multinomial distribution
  • Given a dataset $D$ with $N$ selection, $x_1,\dots, x_n$
    → 이 상황은 각각의 $N$개의 Data를 각각 prediction한 상황(각 사건을 한번에 모두 예측했으므로 Joint Probability로 해석하게 되므로 $\Pi_{n=1}^N$이라는 notation이 위의 수식앞에 붙게 됨.
    • $P(X|\mu) = \Pi_{n=1}^N\Pi_{k=1}^K \mu_k^{x_{nk}} = \pi_{k=1}^K\mu_k^{\sum_{n=1}^N x_{nk}} = \Pi_{k=1}^K \mu_k^{m_k}$
      • When $m_k = \sum_{n=1}^N x_{nk}$
      • Number of selecting $k^{th}$ option out of $N$ selections?
    • How to determine the maximum likelihood solution of $\mu$? → MLE
      • Maximize $P(X|\mu) = \Pi_{k=1}^K \mu_k^{m_k}$
      • Subject to $\mu_k \ge 0, \sum_k \mu_k = 1$
      • $P(X|\mu)$를 maximize하려고 하는데 $\mu_k \ge 0, \sum_k \mu_k = 1$이런 제약조건이 있으므로 이것을 고려해서 Maximize를 해야하는 상황.

Lagrange Method

  • Method of finding a local maximum subject to constraints(제약 조건)
    • Maximize $f(x,y)$
    • Subejct to $g(x,y) = c$
    • Assuming that $f$ and $g$ have continuous partial derivatives
      1. Lagrange function and multiplier (do you recall this?)
        1. $L(x, y, \lambda) = f(x,y) + \lambda(g(x,y) - c)$
        2. $L(\mu, m, \lambda) = \sum_{k=1}^Km_k\ln\mu_k + \lambda(\sum_{k=1}^K \mu_k - 1)$
          • Using tha log likelihodd
      2. Take the partial first-order derivative of variables, and set it to be sero
        • $\frac{d}{d\mu_k}L(\mu, m, \lambda) = \frac{m_k}{\mu_k} + \lambda = 0$ → $\mu_k = -\frac{m_k}{\lambda}$
      3. Utilize the constraint to get the optimized value
        • $\sum_k \mu_k = 1$ → $\sum_k - \frac{m_k}{\lambda} = 1$ → $\sum_km_k = -\lambda$ → $\sum_k\sum_{n=1}^N x_{nk} = -\lambda$
          → $N = -\lambda$
        • $\mu_k = \frac{m_k}{N}$ : MLE parameter of multinomial distribution
          • 즉, $k$번째 카테고리를 선택한 횟수 / 전체 사건 횟수를 나타내는 $\mu_k$를 optimize를 하면 확률 값이 올라간다.