2011年3月22日 星期二

[aMMAI] Paper Summary: Paper Summary: Latent Dirichlet allocation

Title: Latent Dirichlet allocation
Author: D. Blei, A. Ng, and M. Jordan
Publication: Journal of Machine Learning Research, 3:993–1022, January 2003








LDA
Latent Dirichlet allocation的縮寫。是一種unsupervised learning,可以用來識別大規模文件集中潛藏的主題訊息,用了Bag-of-word的概念,不考慮字出現的次序,減化問題的複雜度。
LDAtopic model是一個三層的Bayes Hierarchy Model,包含topicdocword三層。將模型的參數看作隨機變量,可以引入控制參數的參數。

Topic model一般來說有兩種分部:
第一種: topic~word的分佈,p(w|z),是多項式分佈
第二種: doc~topic的分佈,p(z|d) ,是Dirichlet分佈
doc 對應到多個topictopic又對應到多個word

降維就是將文件投影到topic的空間中。

LDA是一個機率生成模型,可以隨機生成一篇由多個主題組成的文件。
定義一些參數:










V: word的大小,表示corpus中有多少字。
w:表示一份文件,N表示文件長度。
D:表示corpus,大小為M,代表有M份文件。

具體生成文件的步驟如下:







1.選擇NN是一篇文件的長度,也就是word的數量,
2.選擇θθ1*K的列向量,是Dirichlet分布代表每個topic發生的機率,alphaDirichlet分布的參數。θ1k都是非負值,且總合為1
3.對每一個word, wn
選一個topic zn,是Multinomial(θ)多項分布。
        選一個word wn,根據p(wn| zn, beta),在zn條件下的多項分佈

這邊的beta,是一個K*V的矩陣,betaij = P(wj=1|zi=1)beta記錄某個topic下生成某個word的機率。


簡單來說,對每一篇文件,從topic分佈中抽出一個主題,再從該主題中,所對應的word分佈抽出一個word,重複上面的步驟,直到Nword都找到。

步驟二,是pLSALDA最大的不同點,LDAdoc~topic中間引入Dirichlet modelθ代表每個topic發生的機率,每一個文章會對應到一個θ。這樣就不會像pLSA一樣,model參數隨著文件的增加而增加。

下圖是LDAGraphical model










塗滿的圓圈表示觀察變量,空心圓表示隱含變量,方框代表重覆的次數。

寫成數學式子如下:







這邊由於引入了alphabeta兩個參數。

所以要做參數的估計,利用EM去求,但是過程中事後機率,p(θ,z|w)無法計算出解析式,需要近似解,尋找一個likelihood函式的下界。
Paper中利用了variational inference來求下界,找出alphabeta,就可以丟到EM裡去計算,直到收斂。


網路上有LDAsource code,可以去找來實驗看看。

參考網站:



2011年3月20日 星期日

[aMMAI] Paper Summary: Probabilistic latent semantic analysis

Title: Probabilistic latent semantic analysis
Author: T. Hofmann
Publication: SIGIR, 1999



LSA:
Latent Semantic Analysis的縮寫,用來分析文件中潛在的語意。
基本的作法是將高維的向量空間(根據term frequency)投影到潛在語意空間中,達到降維的效果。
具體的作法是用奇異值分析singular value Decomposition(SVD)
先定義一些參數:
z屬於ZZ={z_1,z_2,…,z_K},有K種語意。
w屬於WW={w_1,w_2,…,w_M},有M種字。
d屬於DD={d_1,d_2,…,d_N},有N份文件。









式子(1),是將維度N*W的矩陣NSVD
式子(2)NV兩個都是正交矩陣(orthogonal matrices)
式子(3),取sigmaK個最大的特徵值,其它設為零,形成新的sigma
式子(4),利用新的sigma產生一個新的N,近似原本的N
式子(5),利用內積來計算文件與文件間的相似度。
式子(6),文件在Latent Space上的座標。


pLSA:
Probabilistic Latent Semantic的縮寫。
LSA上加入統計機率模型的概念,解決了同義詞與一字多意的問題。
也是先定義一些參數:
P(d): 選到文件d的機率。
P(z|d): 文件d選到潛在語意z的機率。
P(w|z): 語意z產生詞w的機率。

觀察到一對(d,w),但是z沒有被加進來,因此利用joint probability model





利用貝氏定理做轉換





根據likelihood原理,經由最大化log-likelihood function決定P(d)P(z|d)P(w|z)





利用EM求解
E-step:





M-step:










因為pLSA有時會出現overfit的問題。
overfit: 一個假設在training data上可以獲得比其它假設更好的fit,但是在training data以外的data set上卻無法有很好的fit
pLSA可以產生所在dataset的文件模型,可是卻無法產生新文件的模型。
因此修改EM,引入了一個新參數beta,稱為tempered EM (TEM)
公式如下:







Beta的起始值為1,接著逐漸減少。
然後根據待訓練數據來測試模型如果成功則使用該beta如果不成功則收斂收斂的意思就是使得beta = n*betan<1

pLSALSA的關係對照











其中p(z)表示Latent Space的信息,也就是主題空間;
p(w|z)表示主題空間與詞空間之間的關係,對應LSA中的V
文件分類時,上述兩部分是在訓練時要保留的資訊,當新文件進入時, 利用EM算法得到新的文件與主题的對應關係p(d|z),並由此得到文件在主題空間上的表示p(z|d)


pLSA的優點:
1.定義的機率模型。
2.LSA隱含了高斯分部,pLSA隱含了Multi-nomial更符合文件特性。
3.pLSA目標是KL-divergence最小,與orthogonal projection不同。
4.解決了多意詞的問題



2011年3月16日 星期三

[aMMAI] Paper Summary: Lost in Binarization : Query-Adaptive Ranking for Similar Image Search with Compact Codes

Title: Lost in Binarization : Query-Adaptive Ranking for Similar Image Search with Compact Codes
Author: Yu-Gang Jiang, Jun Wang, Shih-Fu Chang
Publication:  ICMR, 2011






Hamming distance lacks in providing good ranking that is crucial for image search – there can be different binary codes sharing equal distance.

This paper proposes a novel approach to compute query-adaptive weights for each bit of the binary codes, which has two main advantages.
First, with the bit-level weights, they are able to rank the returned images at a finer-grained binary code level, rather than at the traditional Hamming distance level.
Second, contrary to using a single set of weights for all the queries, our approach tailors a different and more suitable set of weights for each query.

In this paper, choose the popular bag-of-visual-words (BoW) framework grounded on the local SIFT features.

Three data structure:
1) inverted index: This largely limits the application of inverted files for large scale image search.
2) tree-based index: not suitable for high-dimensional feature
3) binary embedding: SSH&DBN (this paper choose to use)

Learning Class-Specific Weights
Intra-class:
To learn k weight vectors a1,...,ak, where ai corresponds to class i.
ai: nonnegative & sum of all a is 1






Sum the distance from every node in class to the class center ci

Inter-class:
This maintains the class relationship, which is important since the semantic classes under our consideration are not exclusive – in fact some of them are highly correlated (e.g., tree and grass).





Add sij to measure the distance from difference class.

Base on above, the following optimization problem to learn the weights for each class:









Where > 0 is a parameter that controls the balance of the two terms.
Optimization problem can be efficiently solved using an iterative quadratic programming (QP) scheme.

Query-Adaptive Weight:
First using Hamming distance retrieval result to generate query adaptive weights
The query adaptive weights aq are computed vi linear combination




T: most 3 relevant semantic classes to query q
Mi: as the number of images from class T (random selection 500 image)

2011年3月9日 星期三

[aMMAI] Paper Summary: Aggregating local descriptors into a compact image representation

Title: Aggregating local descriptors into a compact image representation
Author: Herve Jegou, Matthijs Douze, Cordelia Schmid, Patrick Perez
Publication: IEEE CVPR'10



Image search on large scale should consider three constraints: the search accuracy, its efficiency and the memory usage.
This is obtained by optimizing:
  1. the representation, i.e., how to aggregate local image descriptors into a vector representation;
  2. the dimensionality reduction of these vectors;
  3. the indexing algorithm.


This paper first contribution consists in proposing a representation that provides excellent search accuracy with a reasonable vector dimensionality, as we know that the vector will be indexed subsequently. They propose a descriptor, derived from both BOF and Fisher kernel that aggregates SIFT descriptors and produces a compact representation. It is termed VLAD (vector of locally aggregated descriptors).

  • VLAD:
    like BOF, the idea of the VLAD descriptor is to accumulate, for each visual word ci, the differences xci of the vectors x assigned to ci.
    This characterizes the distribution of the vectors with respect to the center.(ci : visual word i, x: the descriptor vectors)

  • Coding vector:
    1. a projection that reduces the dimensionality of the vector.
    Method: Approximate nearest neighbors, then using the asymmetric distance computation (ADC) variant of this approach.
    2. quantization used to index the resulting vectors.
    Method: PCA


Second contribution, they show the advantage of jointly optimizing the trade-off between the dimensionality reduction and the indexation algorithm.
Optimizing the dimension D ( the D dimensional VLAD vector reduce by PCA). Empirically measured the mean square error, the optimization selects D=64.

2011年3月1日 星期二

[aMMAI] Paper Summary: Efficient visual search of videos cast as text retrieval

Title: Efficient visual search of videos cast as text retrieval
Author: Josef Sivic and Andrew Zisserman
Publication: IEEE TPAMI, 2009



The paper describes an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object.
Employing methods form statistical text retrieval to efficient retrieval.
The methods including inverted file system, and text and document frequency weighting.

Object retrieval using visual words can be divide two parts off-line and run-time
1. Pre-processing (off-line)

  • Detect affine covariant regions in each keyframe of the video. (Shape Adapted & Maximally Stable)
    Represent each region by a SIFT descriptor.
  • Track the regions through the video and reject unstable regions.
  • Build a visual vocabulary by clustering stable regions from a subset of the video. Assign each region descriptor in each keyframe to the nearest cluster centre.
  • Remove stop-listed visual words
  • Compute tf–idf weighted document frequency vectors
  • Build the inverted file indexing structure
2. At run-time (given a user selected query region)
  • Determine the set of visual words within the query region.
  • Retrieve keyframes based on visual word frequencies
  • Re-rank the top Ns(= 500) retrieved keyframes using spatial consistency


Then they compare the time complexity of retrieval architecture to no vector quantization.
The time complexity with vector quantization: O(MN).No vector quantization: O(NR^2D).
M: each frame contain M distinct visual words
R: each frame contain R region.
N: frames
D: SIFT feature dimension

The method in this paper allows retrieval for a particular visual aspect of an object. However, temporal information within a shot may be used to group visual aspects, and enable object level retrieval.

Some difference between document retrieval by bag-of-word and frame retrieval by bag-of-visual word:
  1. bag-of-word discards spatial information
  2. A proportion of visual word expects to match between query and target not all.
  3. Internet search exploit cues such as the link structure of the web and web-page




***** Personal Notes ******

Visual Analogy:
Analogy是指不相似的東西在某些部分有相似的地方。
Visual Analogy也可以用上面的方式來解釋。
這個網站有用圖解說明什麼是Visual Analogy

[aMMAI] Paper Summary: Distinctive Image Features from Scale-Invariant Keypoints

Title: Distinctive Image Features from Scale-Invariant Keypoints
Author: David G. Lowe
Publication: IJCV, 2004


  
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene.

The features are invariant to image scaling and rotation, and partially invariant to change in illumination and 3D camera viewpoint. They are well localized in both the spatial and frequency domains, reducing the probability of disruption by occlusion, clutter, or noise. In addition, the features are highly distinctive, which allows a single feature to be correctly matched with high probability against a large database of features, providing a basis for object and scene recognition.

The approach of Scale Invariant Feature Transform (SIFT) is as follows
  1. Scale-space extrema detection: Using a DoG function to identify potential interest points
  2. Keypoint localization: Determine the location and scale of the candidates.
  3. Orientation assignment: One or more orientations are assigned to each keypoint location based on local image gradient directions.
  4. Keypoint descriptor: Measure image gradients. Orientation histogram.


This paper has also presented methods for using the keypoints for object recognition such as using approximate nearest-neighbor lookup, a Hough transform for identifying clusters that agree on object pose, least-squares pose determination, and final verification. Other potential applications include view matching for 3D reconstruction, motion tracking and segmentation, robot localization, image panorama assembly, epipolar calibration, and any others that require identification of matching locations between images.



**** 寫來自己看的筆記 ****

SIFT目的是要改善Harris corner detector不是scale-invariant這個問題。
SIFT的四步驟:
1.1利用DoG建立scale space
2、跟圖3一起看的話,會比較清楚(因為我圖2一直看不懂)
3最上面三張照片,相當於圖2左下Scale(first octave)的部分。
3最上面三張照片表示不同scale的圖,最左邊是原圖,中間跟右邊利用Gaussian filter不同的sigma值計算出來不同scale的圖。
Gaussian filtersigma取原來的平方時,就跳到另外一個octavesample rate減半(512*512 -> 256*256),所以圖就變小了。(不過我不懂為什麼當跳到另一層的octave sample就可以減半….)
將這些圖相減,就可以得到DoG值。

1.2extrema

這就蠻好懂的,就是找出與前後左右、上一張、下一張間最不相同的點,做為feature point

2.從上面取出的feature point中刪掉一些不stable的點。
stable的點: 對比低的點、可能是edge的點
利用泰勒展開式。公式不好貼,詳細就再去看paper....

3.找出主要的方向
對每一個feature point都去計算gradient的大小和方向。
利用orientation histogram

4. 找出description

pixel附近,8*8window切成2*2sub-window,每個sub-window去統計裡面4*4的大小跟方向,形成8維的方向。
這樣每個pixel就會有4*8 = 32的維度。


參考資料:
裡面有提到DoG,圖3從這邊取出來的。
Feature Matching CYY的投影片說明,講得很詳細。