optimization - Optimizing clustering in Python -
i wrote own clustering algorithm (bad, know) problem. works well, work faster.
algorithm takes list of values (1d) in input, , works this:
- for each cluster, calculate distance closest neighbor cluster
- select cluster has smallest distance neighbor b
- if distance between , b less threshold, return
- combine , b
- goto 1.
i reinvented wheel here..
this brute foce code, how make faster? i've scipy , numpy installed, if there's ready made
#cluster center simple average value def cluster_center(cluster): return sum(cluster) / len(cluster) #distance between clusters def cluster_distance(a, b): return abs(cluster_center(a) - cluster_center(b)) while true: cluster_distances = [] #if nothing cluster, ready if len(clusters) < 2: break #go thru clusters, calculate shortest distance neighbor cluster in clusters: cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) c in clusters if c != cluster])[0])) #find out closest pair cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1])) #check if distance under threshold 15 if cluster_distances[0][1][0] < 15: = cluster_distances[0][0] b = cluster_distances[0][1][1] #combine clusters (combine lists) a.extend(b) #form new cluster list clusters = [c[0] c in cluster_distances if c[0] != b] else: break
usually, term "cluster analysis" used multi-variate partitions. because in 1d, can sort data, , solve of these problems easier way.
so speed approach, sort data! , reconsider need do.
as more advanced method: kernel density estimation, , local minima splitting points.
Comments
Post a Comment