Wednesday, April 3, 2013

Applying K-mean on CSV files using Python

What is K-mean?

K-mean is an easy to clustering the data, which knowing their features already. We call the input data entities as "observation", and the output groups as "cluster". Today, k-means is working for labeling n the observations into k clusters.

What is CSV?

CSV is a data storage format in plain text. This can be generate from excel, google form, etc. It's also easy to be applied in different languages, since it's simple syntax.

What Am I Writing Today?

I am writing a program can read the table from a CSV file which may be generated by excel or google drive form, and apply the k-mean algorithm. At last, it output the clusters as result showing the items in different clusters, also draw the points on the screen.


#!/usr/bin/python

# This program attend to read data from a csv file,
# and apply kmean, then output the result.

from pylab            import plot,show
from numpy            import vstack,array
from numpy.random     import rand
from scipy.cluster.vq import kmeans, vq, whiten

import csv

if __name__ == "__main__":

    # clusters
    K = 3

    data_arr = []
    meal_name_arr = []

    with open('meals2.csv', 'rb') as f:
        reader = csv.reader(f)
        for row in reader:
            data_arr.append([float(x) for x in row[1:]])
            meal_name_arr.append([row[0]])

    data = vstack( data_arr )
    meal_name = vstack(meal_name_arr)

    # normalization
    data = whiten(data)

    # computing K-Means with K (clusters)
    centroids, distortion = kmeans(data,3)
    print "distortion = " + str(distortion)

    # assign each sample to a cluster
    idx,_ = vq(data,centroids)

    # some plotting using numpy's logical indexing
    plot(data[idx==0,0], data[idx==0,1],'ob',
         data[idx==1,0], data[idx==1,1],'or',
         data[idx==2,0], data[idx==2,1],'og')

    print meal_name
    print data

    for i in range(K):
        result_names = meal_name[idx==i, 0]
        print "================================="
        print "Cluster " + str(i+1)
        for name in result_names:
            print name

    plot(centroids[:,0],
         centroids[:,1],
         'sg',markersize=8)

    show()