.. title: Belajar KMeans Menggunakan Python .. slug: .. date: 2019-05-5 .. tags: KMeans, Python, Clustering, sklearn .. category: Data Aanalysis .. link: .. description: Klasterisasi data menggunakan Python .. type: text .. figure:: https://www.kcet.org/sites/kl/files/atoms/article_atoms/www.kcet.org/living/homegarden/firewood-01.jpg :target: https://www.kcet.org/sites/kl/files/atoms/article_atoms/www.kcet.org/living/homegarden/firewood-01.jpg
Photo by Horia Varlan/Flickr/Creative Commons License
Apa itu KMeans? ###########
KMeans merupakan metode klasterisasi yang mudah dan sederhana. Klasterisasi KMeans didasarkan pada kedekatan data. Jarak antar titik data dapat diukur menggunakan metode Eucledian, Mnhattan atau Minkowski. Berikut adalah klasterisasi KMeans menggunkan Python dengan menggunakan dataset bunga Iris.
Dataset Bunga Iris?
High level conceptual information (Heading 2).
At a minimum, a concept includes the following components.
- A title, phrased as a gerund or question.
- One or more body paragraphs.
Complex concepts may contain 2 or more subsections.
When you need to break down a subject, you can break it down into subsections (H3s). Typically you would have 0 H3s, or 2+ H3s.
When you need to break down a subject, you can break it down into subsections (H3s)
Do this
A task typically follows conceptual information. Task titles should be imperative. Tasks should have a short introduction sentence that captures the user's goal and introduces the steps, for example, "Verify your products are in the catalog:"
A task should have 3 - 7 steps. Tasks with more should be broken down into digestible chunks.
Intro sentence.
#. Step 1.
#. Step 2.
#. Step 3.
Following the steps, you should add the result and any follow-up tasks needed.
.. code:: ipython3
from sklearn.cluster import KMeans
.. code:: ipython3
import pandas as pd
.. code:: ipython3
df=pd.read_csv('iris.data', names=['sepal_length','sepal_width','petal_length','petal_width','class'])
.. code:: ipython3
df.head()
.. raw:: html
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sepal_length</th>
<th>sepal_width</th>
<th>petal_length</th>
<th>petal_width</th>
<th>class</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>5.1</td>
<td>3.5</td>
<td>1.4</td>
<td>0.2</td>
<td>Iris-setosa</td>
</tr>
<tr>
<th>1</th>
<td>4.9</td>
<td>3.0</td>
<td>1.4</td>
<td>0.2</td>
<td>Iris-setosa</td>
</tr>
<tr>
<th>2</th>
<td>4.7</td>
<td>3.2</td>
<td>1.3</td>
<td>0.2</td>
<td>Iris-setosa</td>
</tr>
<tr>
<th>3</th>
<td>4.6</td>
<td>3.1</td>
<td>1.5</td>
<td>0.2</td>
<td>Iris-setosa</td>
</tr>
<tr>
<th>4</th>
<td>5.0</td>
<td>3.6</td>
<td>1.4</td>
<td>0.2</td>
<td>Iris-setosa</td>
</tr>
</tbody>
</table>
</div>
.. code:: ipython3
df.to_csv('iris.csv')
.. code:: ipython3
df.describe()
.. raw:: html
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sepal_length</th>
<th>sepal_width</th>
<th>petal_length</th>
<th>petal_width</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>150.000000</td>
<td>150.000000</td>
<td>150.000000</td>
<td>150.000000</td>
</tr>
<tr>
<th>mean</th>
<td>5.843333</td>
<td>3.054000</td>
<td>3.758667</td>
<td>1.198667</td>
</tr>
<tr>
<th>std</th>
<td>0.828066</td>
<td>0.433594</td>
<td>1.764420</td>
<td>0.763161</td>
</tr>
<tr>
<th>min</th>
<td>4.300000</td>
<td>2.000000</td>
<td>1.000000</td>
<td>0.100000</td>
</tr>
<tr>
<th>25%</th>
<td>5.100000</td>
<td>2.800000</td>
<td>1.600000</td>
<td>0.300000</td>
</tr>
<tr>
<th>50%</th>
<td>5.800000</td>
<td>3.000000</td>
<td>4.350000</td>
<td>1.300000</td>
</tr>
<tr>
<th>75%</th>
<td>6.400000</td>
<td>3.300000</td>
<td>5.100000</td>
<td>1.800000</td>
</tr>
<tr>
<th>max</th>
<td>7.900000</td>
<td>4.400000</td>
<td>6.900000</td>
<td>2.500000</td>
</tr>
</tbody>
</table>
</div>
.. code:: ipython3
import seaborn as sb
.. code:: ipython3
g=sb.pairplot(df, hue='class')
g.map_lower(sb.kdeplot)
g.map_diag(sb.kdeplot)
.. parsed-literal::
<seaborn.axisgrid.PairGrid at 0xc30557f470>
.. image:: output_7_1.png
.. code:: ipython3
df.corr()
.. raw:: html
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sepal_length</th>
<th>sepal_width</th>
<th>petal_length</th>
<th>petal_width</th>
</tr>
</thead>
<tbody>
<tr>
<th>sepal_length</th>
<td>1.000000</td>
<td>-0.109369</td>
<td>0.871754</td>
<td>0.817954</td>
</tr>
<tr>
<th>sepal_width</th>
<td>-0.109369</td>
<td>1.000000</td>
<td>-0.420516</td>
<td>-0.356544</td>
</tr>
<tr>
<th>petal_length</th>
<td>0.871754</td>
<td>-0.420516</td>
<td>1.000000</td>
<td>0.962757</td>
</tr>
<tr>
<th>petal_width</th>
<td>0.817954</td>
<td>-0.356544</td>
<td>0.962757</td>
<td>1.000000</td>
</tr>
</tbody>
</table>
</div>
.. code:: ipython3
corr=df.corr()
sb.set(font_scale=1.25)
sb.heatmap(corr, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)
.. parsed-literal::
<matplotlib.axes._subplots.AxesSubplot at 0xc3068b06a0>
.. image:: output_9_1.png
.. code:: ipython3
sb.set(font_scale=1.4)
sb.clustermap(corr, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)
.. parsed-literal::
<seaborn.matrix.ClusterGrid at 0xc3075c4f28>
.. image:: output_10_1.png
.. code:: ipython3
X=df.iloc[:,0:4]
X.head()
.. raw:: html
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sepal_length</th>
<th>sepal_width</th>
<th>petal_length</th>
<th>petal_width</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>5.1</td>
<td>3.5</td>
<td>1.4</td>
<td>0.2</td>
</tr>
<tr>
<th>1</th>
<td>4.9</td>
<td>3.0</td>
<td>1.4</td>
<td>0.2</td>
</tr>
<tr>
<th>2</th>
<td>4.7</td>
<td>3.2</td>
<td>1.3</td>
<td>0.2</td>
</tr>
<tr>
<th>3</th>
<td>4.6</td>
<td>3.1</td>
<td>1.5</td>
<td>0.2</td>
</tr>
<tr>
<th>4</th>
<td>5.0</td>
<td>3.6</td>
<td>1.4</td>
<td>0.2</td>
</tr>
</tbody>
</table>
</div>
.. code:: ipython3
Y=df['class']
Y.head()
.. parsed-literal::
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Name: class, dtype: object
.. code:: ipython3
from sklearn.preprocessing import LabelEncoder
.. code:: ipython3
lbe=LabelEncoder().fit(Y)
.. code:: ipython3
Y=lbe.transform(Y)
.. code:: ipython3
Y
.. parsed-literal::
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
.. code:: ipython3
kmeans=KMeans(n_clusters=3)
.. code:: ipython3
kmeans.fit(X)
.. parsed-literal::
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
.. code:: ipython3
y_kmeans=kmeans.predict(X)
.. code:: ipython3
import matplotlib.pyplot as plt
.. code:: ipython3
centers=kmeans.cluster_centers_
.. code:: ipython3
plt.scatter(X.iloc[:,1], X.iloc[:,2], c=y_kmeans, s=100, cmap='viridis')
plt.scatter(centers[:,1], centers[:,2], c='black', s=2500, alpha=0.75)
plt.grid(False)
.. image:: output_22_0.png
.. code:: ipython3
from sklearn.metrics import confusion_matrix
.. code:: ipython3
confusion_matrix(Y, y_kmeans)
.. parsed-literal::
array([[ 0, 50, 0],
[ 2, 0, 48],
[36, 0, 14]], dtype=int64)
.. code:: ipython3
conf=confusion_matrix(Y, y_kmeans)
sb.set(font_scale=1.25)
sb.heatmap(conf.T, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)
plt.xlabel('Predicted Value')
plt.ylabel('True Value')
.. parsed-literal::
Text(78.90000000000006, 0.5, 'True Value')
.. image:: output_25_1.png
.. code:: ipython3
from sklearn.metrics import accuracy_score
accuracy_score(Y, y_kmeans)
.. parsed-literal::
0.09333333333333334
Adjusted Mutual Information
Metrik ini merupakan pengukuran simetris: perbandingan label_true dengan
label_pred akan menghasilkan nilai skor yang sama. Ini bisa bermanfaat
untuk mengukur kesesuaian dua label independen pada dataset yang sama
ketika label yang sebenarnya tidak diketahui.
.. code:: ipython3
from sklearn.metrics import adjusted_mutual_info_score
print(adjusted_mutual_info_score(Y, y_kmeans, average_method='arithmetic'))
print("Skore mendekati nilai 1 untuk klaster yang identik.")
.. parsed-literal::
0.7551191675800482
Skore mendekati nilai 1 untuk klaster yang identik.
Rand index adjusted
~~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import adjusted_rand_score
print(adjusted_rand_score(Y, y_kmeans))
print("Skore mendekati nilai 1 untuk klaster yang identik.")
.. parsed-literal::
0.7302382722834697
Skore mendekati nilai 1 untuk klaster yang identik.
Calinski and Harabaz score
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import calinski_harabaz_score
print(calinski_harabaz_score(X, y_kmeans))
print("Skore menunjukkan rasio dispersi dalam klaster terhadap dispersi antar klaster.")
.. parsed-literal::
560.3999242466401
Skore menunjukkan rasio dispersi dalam klaster terhadap dispersi antar klaster.
Davies-Bouldin score
~~~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import davies_bouldin_score
print(davies_bouldin_score(X, y_kmeans))
print("Skore menunjukkan rasio jarak dalam klaster terhadap jarak antar klaster.")
.. parsed-literal::
0.6623228649898758
Skore menunjukkan rasio jarak dalam klaster terhadap jarak antar klaster.
.. parsed-literal::
C:\WPy64-3720\python-3.7.2.amd64\lib\site-packages\sklearn\metrics\cluster\unsupervised.py:342: RuntimeWarning: divide by zero encountered in true_divide
score = (intra_dists[:, None] + intra_dists) / centroid_distances
Completeness Score
~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import completeness_score
print(completeness_score(Y, y_kmeans))
print("*Completeness* menunjukkan apakah semua titik data untuk kelompok yang sama merupakan anggota klaster yang sama.")
.. parsed-literal::
0.7649861514489815
*Completeness* menunjukkan apakah semua titik data untuk kelompok yang sama merupakan anggota klaster yang sama.
Contingency Matrix
~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics.cluster import contingency_matrix
print(contingency_matrix(Y, y_kmeans,eps=None, sparse=False))
.. parsed-literal::
[[ 0 50 0]
[ 2 0 48]
[36 0 14]]
Matriks antara nilai sebenarnya dengan nilai prediksi. Matrix ini serupa
dengan *Confusion Matrix*
Fowlkes Mallows Score
~~~~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import fowlkes_mallows_score
print(fowlkes_mallows_score(Y, y_kmeans))
.. parsed-literal::
0.8208080729114153
Skore bernilai antara 0 sampai dengan 1. Semakin tinggi nilai semakin
bagus tingkat kesamaan antara nilai sebenarnya dengan nilai prediksi.
Homogeneity and Completeness
.. code:: ipython3
from sklearn.metrics import homogeneity_completeness_v_measure
homogeneity_completeness_v_measure(Y, y_kmeans)
.. parsed-literal::
(0.7514854021988338, 0.7649861514489815, 0.7581756800057784)
Mempunyai nilai antara 0.0 sampai dengan 1.0. Nilai yang baik adalah nilai yang mendekati 1.0
Homogeneity Score
.. code:: ipython3
from sklearn.metrics import homogeneity_score
homogeneity_score(Y, y_kmeans)
.. parsed-literal::
0.7514854021988338
Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi
homogen
Mutual Info Score
.. code:: ipython3
from sklearn.metrics import mutual_info_score
mutual_info_score(Y, y_kmeans, contingency=None)
.. parsed-literal::
0.8255910976103356
Nilai yang didasarkan pada perhitungan contingency matrix
Normalized Mutual Info Score
.. code:: ipython3
from sklearn.metrics import normalized_mutual_info_score
normalized_mutual_info_score(Y, y_kmeans, average_method='arithmetic')
.. parsed-literal::
0.7581756800057784
Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi total
Silhouette Score
~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import silhouette_score
print(silhouette_score(X, y_kmeans))
.. parsed-literal::
0.5525919445499756
Nilai paling bagus adalah 1 dan nilai paling jelek adalah -1. Nilai
mendekati 0 menunjukkan klaster yang tumpang tindih. Nilai negatif
menunjukkan bahwa sampel dikelompokkan ke dalam klaster yang salah
Silhouette Samples
~~~~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import silhouette_samples
print(silhouette_samples(X, y_kmeans))
.. parsed-literal::
[0.85157298 0.817887 0.83008729 0.8065908 0.84699565 0.74628444
0.8210796 0.85340748 0.75384818 0.82895302 0.80150542 0.83563957
0.81325176 0.74707696 0.70091086 0.64149392 0.77354754 0.84964579
0.70548523 0.8178354 0.78413148 0.8237893 0.79157875 0.79422255
0.77521625 0.80130908 0.8329435 0.84096492 0.84314169 0.81915866
0.81735915 0.79854746 0.76017812 0.71993736 0.82895302 0.83285788
0.79335138 0.82895302 0.7698255 0.84989778 0.84788216 0.6413782
0.78707116 0.7991425 0.74523195 0.81162359 0.81106264 0.8198735
0.81643527 0.85237895 0.02672203 0.38118643 0.05340075 0.59294381
0.36885321 0.59221025 0.28232583 0.26365142 0.34419223 0.57829491
0.3733641 0.58710354 0.55107857 0.48216686 0.56268236 0.32459291
0.55751057 0.61072967 0.46149897 0.6115753 0.32909528 0.58968904
0.31046301 0.49424779 0.5000461 0.38548959 0.12629433 0.11798213
0.55293611 0.50620254 0.59466094 0.56000896 0.61972579 0.26087292
0.54077013 0.41598629 0.16655431 0.48935747 0.60716023 0.61436443
0.59560929 0.50352722 0.62444848 0.29200997 0.62754454 0.60657448
0.62205599 0.55780204 0.13937138 0.63064081 0.49927538 0.23225278
0.61193633 0.36075942 0.5577792 0.54384277 0.46682151 0.55917348
0.44076207 0.56152256 0.26062588 0.22965423 0.55509948 0.28503067
0.02635881 0.39825264 0.42110831 0.49486598 0.48341063 0.32868889
0.6070348 0.33355947 0.51237366 0.20297372 0.580154 0.57818326
0.30904249 0.25226992 0.45434264 0.51608826 0.56017398 0.48442397
0.46255248 0.13900039 0.05328614 0.55186784 0.45549975 0.3887791
0.35124673 0.53444618 0.5702338 0.41025549 0.23225278 0.61324746
0.5670778 0.42513648 0.10417086 0.31493016 0.35245379 0.18544229]
Nilai paling bagus adalah 1 dan nilai paling jelek adalah -1. Nilai
mendekati 0 menunjukkan klaster yang tumpang tindih.
v_measure Score
~~~~~~~~~~~~~~~
.. code:: ipython3
from sklearn.metrics import v_measure_score
print(v_measure_score(Y, y_kmeans))
.. parsed-literal::
0.7581756800057784
Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi total
.. code:: ipython3
from sklearn.cluster import k_means
k_means(X, n_clusters=3, sample_weight=None, init='k-means++', precompute_distances='auto', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, n_jobs=None, algorithm='auto', return_n_iter=True)
.. parsed-literal::
(array([[5.006 , 3.418 , 1.464 , 0.244 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]]),
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1]),
78.94084142614602,
6)
Fungsi k_means di atas akan memberikan *array* berisi *Centroid*
*Array* berisi *Label*, *Inertia* dan jumlah *Iterasi*