Word clusters Daniel Preotiuc-Pietro, 31 March 2015, danielpr@sas.upenn.edu Word clusters obtained using different methods. The clusters are hard clusters (one word can belong to only one topic). The importance score represents how central is that word in its cluster. Clusters are computed using spectral clustering over a word-word similarity matrix. The difference between methods is how the similarity is computed. The clusters mainly are useful as features in predictive tasks. All files are in the format: wordimportancecluster_id 1. NPMI The word similarity is the Normalised Pointwise Mutual Information [1] over a large corpus. Here, the corpus is 58 days of 10% of the Twitter stream. [2] 2. W2V Similarity is the cosine between the two word embeddings. The embeddings are derived using Word2Vec [3] with a layer size of 50. The embeddings are learned on the same corpus as the NPMI matrix using Gensim [4]. 3. GloVe Similarity is the cosine between the word embeddings obtained using GloVe [5]. The embeddings are downloaded from [6] with layer size of 200. References: [1] Normalised Pointwise Mutual Information - G.Bouma [2] Predicting and Characterising User Impact on Twitter - V.Lampos, N.Aletras, D.Preotiuc-Pietro, T.Cohn [3] Efficient estimation of word representations in vector space - T.Mikolov, K.Chen, G.Corrado, J.Dean [4] http://radimrehurek.com/gensim/models/word2vec.html [5] GloVe: Global Vectors for Word Representation - J.Pennington, R.Socher, C.Manning [6] http://nlp.stanford.edu/projects/glove/