Hi,
Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering?
First here on the `tfidf_matrix' using cosine distance:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:
linkage_matrix = ward(dist)
Is this something specially done for text clustering?
Thanks again
Hi,
Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering?
First here on the `tfidf_matrix' using cosine distance:
from sklearn.metrics.pairwise import cosine_similaritydist = 1 - cosine_similarity(tfidf_matrix)and second time here over the
distthrough ward function that runs euclidean distance before doing the ward linkage:linkage_matrix = ward(dist)Is this something specially done for text clustering?
Thanks again