machine learning - Adding new terms to a bag-of-words model -
i'm using k-means clustering group set of news items. i'm using bag-of-words model represent documents, more specifically, each document represented term frequency vector.
my question: how can add new documents without having recalculate term frequency vectors (seen vocabulary containing terms docs change)?
the easy solution use vocabulary documents you've seen, ignoring new terms; customary in document classification.
another solution, that's become popular in recent years, ditch vocabulary altogether , use feature hashing.
a third possibility reserve space in feature vectors future terms. e.g., let's you're vectorizing bunch of documents vocabulary of size n, turn them vectors of size n + k final k set zero, can add k terms vocabulary later on.
(what not solution compute dot products, means, etc. on hash tables directly. flexible approach, it's very slow.)
Comments
Post a Comment