What is K-mean?
K-mean is an easy to clustering the data, which knowing their features already. We call the input data entities as "
observation", and the output groups as "
cluster". Today, k-means is working for labeling n the observations into k clusters.
What is CSV?
CSV is a data storage format in plain text. This can be generate from excel, google form, etc. It's also easy to be applied in different languages, since it's simple syntax.
What Am I Writing Today?
I am writing a program can read the table from a CSV file which may be generated by excel or google drive form, and apply the k-mean algorithm. At last, it output the clusters as result showing the items in different clusters, also draw the points on the screen.
#!/usr/bin/python
# This program attend to read data from a csv file,
# and apply kmean, then output the result.
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans, vq, whiten
import csv
if __name__ == "__main__":
# clusters
K = 3
data_arr = []
meal_name_arr = []
with open('meals2.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
data_arr.append([float(x) for x in row[1:]])
meal_name_arr.append([row[0]])
data = vstack( data_arr )
meal_name = vstack(meal_name_arr)
# normalization
data = whiten(data)
# computing K-Means with K (clusters)
centroids, distortion = kmeans(data,3)
print "distortion = " + str(distortion)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0], data[idx==0,1],'ob',
data[idx==1,0], data[idx==1,1],'or',
data[idx==2,0], data[idx==2,1],'og')
print meal_name
print data
for i in range(K):
result_names = meal_name[idx==i, 0]
print "================================="
print "Cluster " + str(i+1)
for name in result_names:
print name
plot(centroids[:,0],
centroids[:,1],
'sg',markersize=8)
show()