machine learning - Mahout: RowSimilarity vs Clustering -


i trying cluster documents using kmeansclustering approach , created clusters. saved cluster id corresponding particular document recommendations. whenever wanted recommend documents similar particular document, query documents in particular cluster , return n random documents cluster. however, returning random document cluster did not seem appropriate , read somewhere should returning documents nearest document in question.

so started searching calculating distance between documents , stumbled upon rowsimilarity approach returns 10 similar documents each document, ordered distance. approach relies on similarity metric loglikelihood etc calculate distance between documents.

now question this. how clustering better/worse rowsimilarity given both approaches use similarity distance metric calculate distance between documents?

what i'm trying achieve i'm trying cluster products on basis of titles , other text properties recommend similar products. appreciated.

clustering not variant of classification or recommendation. different discipline.

when doing cluster analysis, want discover structure in data. then, should analyzing structure found.

now k-means not meant documents. tries find near optimal partitioning of data set k voronoi cells. unless have reason believe voronoi cells partitioning data, algorithm may pretty useless. because returns result not @ indicate result useful.

for documents, euclidean distance (and k-means in fact optimizing euclidean distances) pretty meaningless. vectors sparse, , k-means cluster centers resemble impossible (and insensible) "average documents".

and havn't started on need find appropriate value of k, on mahout implementation being approximation of lloyds k-means approximation, , on. did check cluster sizes? in situations these, k-means produce degenerate results. example, clusters containing 1 or 0 elements, , mega-cluster containing rest. in situation, might in fact returning random documents database...

just because can use not mean helpful. make sure validate individual steps of approach, example if clusters in way useful , sensible!


Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -