machine learning - Adding new terms to a bag-of-words model -


i'm using k-means clustering group set of news items. i'm using bag-of-words model represent documents, more specifically, each document represented term frequency vector.

my question: how can add new documents without having recalculate term frequency vectors (seen vocabulary containing terms docs change)?

the easy solution use vocabulary documents you've seen, ignoring new terms; customary in document classification.

another solution, that's become popular in recent years, ditch vocabulary altogether , use feature hashing.

a third possibility reserve space in feature vectors future terms. e.g., let's you're vectorizing bunch of documents vocabulary of size n, turn them vectors of size n + k final k set zero, can add k terms vocabulary later on.

(what not solution compute dot products, means, etc. on hash tables directly. flexible approach, it's very slow.)


Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -