Skip to content

LearningBox-Suprapto/KMeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KMeans

.. title: Belajar KMeans Menggunakan Python .. slug: .. date: 2019-05-5 .. tags: KMeans, Python, Clustering, sklearn .. category: Data Aanalysis .. link: .. description: Klasterisasi data menggunakan Python .. type: text .. figure:: https://www.kcet.org/sites/kl/files/atoms/article_atoms/www.kcet.org/living/homegarden/firewood-01.jpg :target: https://www.kcet.org/sites/kl/files/atoms/article_atoms/www.kcet.org/living/homegarden/firewood-01.jpg

Photo by Horia Varlan/Flickr/Creative Commons License

Apa itu KMeans? ###########

KMeans merupakan metode klasterisasi yang mudah dan sederhana. Klasterisasi KMeans didasarkan pada kedekatan data. Jarak antar titik data dapat diukur menggunakan metode Eucledian, Mnhattan atau Minkowski. Berikut adalah klasterisasi KMeans menggunkan Python dengan menggunakan dataset bunga Iris.

Dataset Bunga Iris?


High level conceptual information (Heading 2).

At a minimum, a concept includes the following components.

  • A title, phrased as a gerund or question.
  • One or more body paragraphs.

Complex concepts may contain 2 or more subsections.

What is ?

When you need to break down a subject, you can break it down into subsections (H3s). Typically you would have 0 H3s, or 2+ H3s.

What is ?

When you need to break down a subject, you can break it down into subsections (H3s)

Do this


A task typically follows conceptual information. Task titles should be imperative. Tasks should have a short introduction sentence that captures the user's goal and introduces the steps, for example, "Verify your products are in the catalog:"

A task should have 3 - 7 steps. Tasks with more should be broken down into digestible chunks.

Intro sentence.

#. Step 1.

#. Step 2.

#. Step 3.

Following the steps, you should add the result and any follow-up tasks needed.

.. code:: ipython3

from sklearn.cluster import KMeans

.. code:: ipython3

import pandas as pd

.. code:: ipython3

df=pd.read_csv('iris.data', names=['sepal_length','sepal_width','petal_length','petal_width','class'])

.. code:: ipython3

df.head()

.. raw:: html

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sepal_length</th>
      <th>sepal_width</th>
      <th>petal_length</th>
      <th>petal_width</th>
      <th>class</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>5.1</td>
      <td>3.5</td>
      <td>1.4</td>
      <td>0.2</td>
      <td>Iris-setosa</td>
    </tr>
    <tr>
      <th>1</th>
      <td>4.9</td>
      <td>3.0</td>
      <td>1.4</td>
      <td>0.2</td>
      <td>Iris-setosa</td>
    </tr>
    <tr>
      <th>2</th>
      <td>4.7</td>
      <td>3.2</td>
      <td>1.3</td>
      <td>0.2</td>
      <td>Iris-setosa</td>
    </tr>
    <tr>
      <th>3</th>
      <td>4.6</td>
      <td>3.1</td>
      <td>1.5</td>
      <td>0.2</td>
      <td>Iris-setosa</td>
    </tr>
    <tr>
      <th>4</th>
      <td>5.0</td>
      <td>3.6</td>
      <td>1.4</td>
      <td>0.2</td>
      <td>Iris-setosa</td>
    </tr>
  </tbody>
</table>
</div>

.. code:: ipython3

df.to_csv('iris.csv')

.. code:: ipython3

df.describe()

.. raw:: html

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sepal_length</th>
      <th>sepal_width</th>
      <th>petal_length</th>
      <th>petal_width</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>150.000000</td>
      <td>150.000000</td>
      <td>150.000000</td>
      <td>150.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>5.843333</td>
      <td>3.054000</td>
      <td>3.758667</td>
      <td>1.198667</td>
    </tr>
    <tr>
      <th>std</th>
      <td>0.828066</td>
      <td>0.433594</td>
      <td>1.764420</td>
      <td>0.763161</td>
    </tr>
    <tr>
      <th>min</th>
      <td>4.300000</td>
      <td>2.000000</td>
      <td>1.000000</td>
      <td>0.100000</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>5.100000</td>
      <td>2.800000</td>
      <td>1.600000</td>
      <td>0.300000</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>5.800000</td>
      <td>3.000000</td>
      <td>4.350000</td>
      <td>1.300000</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>6.400000</td>
      <td>3.300000</td>
      <td>5.100000</td>
      <td>1.800000</td>
    </tr>
    <tr>
      <th>max</th>
      <td>7.900000</td>
      <td>4.400000</td>
      <td>6.900000</td>
      <td>2.500000</td>
    </tr>
  </tbody>
</table>
</div>

.. code:: ipython3

import seaborn as sb

.. code:: ipython3

g=sb.pairplot(df, hue='class')
g.map_lower(sb.kdeplot)
g.map_diag(sb.kdeplot)

.. parsed-literal::

<seaborn.axisgrid.PairGrid at 0xc30557f470>

.. image:: output_7_1.png

.. code:: ipython3

df.corr()

.. raw:: html

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sepal_length</th>
      <th>sepal_width</th>
      <th>petal_length</th>
      <th>petal_width</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>sepal_length</th>
      <td>1.000000</td>
      <td>-0.109369</td>
      <td>0.871754</td>
      <td>0.817954</td>
    </tr>
    <tr>
      <th>sepal_width</th>
      <td>-0.109369</td>
      <td>1.000000</td>
      <td>-0.420516</td>
      <td>-0.356544</td>
    </tr>
    <tr>
      <th>petal_length</th>
      <td>0.871754</td>
      <td>-0.420516</td>
      <td>1.000000</td>
      <td>0.962757</td>
    </tr>
    <tr>
      <th>petal_width</th>
      <td>0.817954</td>
      <td>-0.356544</td>
      <td>0.962757</td>
      <td>1.000000</td>
    </tr>
  </tbody>
</table>
</div>

.. code:: ipython3

corr=df.corr()
sb.set(font_scale=1.25)
sb.heatmap(corr, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)

.. parsed-literal::

<matplotlib.axes._subplots.AxesSubplot at 0xc3068b06a0>

.. image:: output_9_1.png

.. code:: ipython3

sb.set(font_scale=1.4)
sb.clustermap(corr, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)

.. parsed-literal::

<seaborn.matrix.ClusterGrid at 0xc3075c4f28>

.. image:: output_10_1.png

.. code:: ipython3

X=df.iloc[:,0:4]
X.head()

.. raw:: html

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>sepal_length</th>
      <th>sepal_width</th>
      <th>petal_length</th>
      <th>petal_width</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>5.1</td>
      <td>3.5</td>
      <td>1.4</td>
      <td>0.2</td>
    </tr>
    <tr>
      <th>1</th>
      <td>4.9</td>
      <td>3.0</td>
      <td>1.4</td>
      <td>0.2</td>
    </tr>
    <tr>
      <th>2</th>
      <td>4.7</td>
      <td>3.2</td>
      <td>1.3</td>
      <td>0.2</td>
    </tr>
    <tr>
      <th>3</th>
      <td>4.6</td>
      <td>3.1</td>
      <td>1.5</td>
      <td>0.2</td>
    </tr>
    <tr>
      <th>4</th>
      <td>5.0</td>
      <td>3.6</td>
      <td>1.4</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>
</div>

.. code:: ipython3

Y=df['class']
Y.head()

.. parsed-literal::

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class, dtype: object

.. code:: ipython3

from sklearn.preprocessing import LabelEncoder

.. code:: ipython3

lbe=LabelEncoder().fit(Y)

.. code:: ipython3

Y=lbe.transform(Y)

.. code:: ipython3

Y

.. parsed-literal::

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

.. code:: ipython3

kmeans=KMeans(n_clusters=3)

.. code:: ipython3

kmeans.fit(X)

.. parsed-literal::

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

.. code:: ipython3

y_kmeans=kmeans.predict(X)

.. code:: ipython3

import matplotlib.pyplot as plt

.. code:: ipython3

centers=kmeans.cluster_centers_

.. code:: ipython3

plt.scatter(X.iloc[:,1], X.iloc[:,2], c=y_kmeans, s=100, cmap='viridis')
plt.scatter(centers[:,1], centers[:,2], c='black', s=2500, alpha=0.75)

plt.grid(False)

.. image:: output_22_0.png

.. code:: ipython3

from sklearn.metrics import confusion_matrix

.. code:: ipython3

confusion_matrix(Y, y_kmeans)

.. parsed-literal::

array([[ 0, 50,  0],
       [ 2,  0, 48],
       [36,  0, 14]], dtype=int64)

.. code:: ipython3

conf=confusion_matrix(Y, y_kmeans)
sb.set(font_scale=1.25)
sb.heatmap(conf.T, square=True, annot=True, fmt='.2g', cmap='viridis', linewidths=1)
plt.xlabel('Predicted Value')
plt.ylabel('True Value')

.. parsed-literal::

Text(78.90000000000006, 0.5, 'True Value')

.. image:: output_25_1.png

.. code:: ipython3

from sklearn.metrics import accuracy_score
accuracy_score(Y, y_kmeans)

.. parsed-literal::

0.09333333333333334

Adjusted Mutual Information


Metrik ini merupakan pengukuran simetris: perbandingan label_true dengan
label_pred akan menghasilkan nilai skor yang sama. Ini bisa bermanfaat
untuk mengukur kesesuaian dua label independen pada dataset yang sama
ketika label yang sebenarnya tidak diketahui.

.. code:: ipython3

    from sklearn.metrics import adjusted_mutual_info_score
    print(adjusted_mutual_info_score(Y, y_kmeans, average_method='arithmetic'))
    print("Skore mendekati nilai 1 untuk klaster yang identik.")


.. parsed-literal::

    0.7551191675800482
    Skore mendekati nilai 1 untuk klaster yang identik.
    

Rand index adjusted
~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import adjusted_rand_score
    print(adjusted_rand_score(Y, y_kmeans))
    print("Skore mendekati nilai 1 untuk klaster yang identik.")


.. parsed-literal::

    0.7302382722834697
    Skore mendekati nilai 1 untuk klaster yang identik.
    

Calinski and Harabaz score
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import calinski_harabaz_score
    print(calinski_harabaz_score(X, y_kmeans))
    print("Skore menunjukkan rasio dispersi dalam klaster terhadap dispersi antar klaster.")


.. parsed-literal::

    560.3999242466401
    Skore menunjukkan rasio dispersi dalam klaster terhadap dispersi antar klaster.
    

Davies-Bouldin score
~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import davies_bouldin_score
    print(davies_bouldin_score(X, y_kmeans))
    print("Skore menunjukkan rasio jarak dalam klaster terhadap jarak antar klaster.")


.. parsed-literal::

    0.6623228649898758
    Skore menunjukkan rasio jarak dalam klaster terhadap jarak antar klaster.
    

.. parsed-literal::

    C:\WPy64-3720\python-3.7.2.amd64\lib\site-packages\sklearn\metrics\cluster\unsupervised.py:342: RuntimeWarning: divide by zero encountered in true_divide
      score = (intra_dists[:, None] + intra_dists) / centroid_distances
    

Completeness Score
~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import completeness_score
    print(completeness_score(Y, y_kmeans))
    print("*Completeness* menunjukkan apakah semua titik data untuk kelompok yang sama merupakan anggota klaster yang sama.")


.. parsed-literal::

    0.7649861514489815
    *Completeness* menunjukkan apakah semua titik data untuk kelompok yang sama merupakan anggota klaster yang sama.
    

Contingency Matrix
~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics.cluster import contingency_matrix
    print(contingency_matrix(Y, y_kmeans,eps=None, sparse=False))


.. parsed-literal::

    [[ 0 50  0]
     [ 2  0 48]
     [36  0 14]]
    

Matriks antara nilai sebenarnya dengan nilai prediksi. Matrix ini serupa
dengan *Confusion Matrix*

Fowlkes Mallows Score
~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import fowlkes_mallows_score
    print(fowlkes_mallows_score(Y, y_kmeans))


.. parsed-literal::

    0.8208080729114153
    

Skore bernilai antara 0 sampai dengan 1. Semakin tinggi nilai semakin
bagus tingkat kesamaan antara nilai sebenarnya dengan nilai prediksi.

Homogeneity and Completeness

.. code:: ipython3

from sklearn.metrics import homogeneity_completeness_v_measure
homogeneity_completeness_v_measure(Y, y_kmeans)

.. parsed-literal::

(0.7514854021988338, 0.7649861514489815, 0.7581756800057784)

Mempunyai nilai antara 0.0 sampai dengan 1.0. Nilai yang baik adalah nilai yang mendekati 1.0

Homogeneity Score


.. code:: ipython3

    from sklearn.metrics import homogeneity_score
    homogeneity_score(Y, y_kmeans)




.. parsed-literal::

    0.7514854021988338



Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi
homogen

Mutual Info Score

.. code:: ipython3

from sklearn.metrics import mutual_info_score
mutual_info_score(Y, y_kmeans, contingency=None)

.. parsed-literal::

0.8255910976103356

Nilai yang didasarkan pada perhitungan contingency matrix

Normalized Mutual Info Score


.. code:: ipython3

    from sklearn.metrics import normalized_mutual_info_score
    normalized_mutual_info_score(Y, y_kmeans, average_method='arithmetic')




.. parsed-literal::

    0.7581756800057784



Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi total

Silhouette Score
~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import silhouette_score
    print(silhouette_score(X, y_kmeans))


.. parsed-literal::

    0.5525919445499756
    

Nilai paling bagus adalah 1 dan nilai paling jelek adalah -1. Nilai
mendekati 0 menunjukkan klaster yang tumpang tindih. Nilai negatif
menunjukkan bahwa sampel dikelompokkan ke dalam klaster yang salah

Silhouette Samples
~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import silhouette_samples
    print(silhouette_samples(X, y_kmeans))


.. parsed-literal::

    [0.85157298 0.817887   0.83008729 0.8065908  0.84699565 0.74628444
     0.8210796  0.85340748 0.75384818 0.82895302 0.80150542 0.83563957
     0.81325176 0.74707696 0.70091086 0.64149392 0.77354754 0.84964579
     0.70548523 0.8178354  0.78413148 0.8237893  0.79157875 0.79422255
     0.77521625 0.80130908 0.8329435  0.84096492 0.84314169 0.81915866
     0.81735915 0.79854746 0.76017812 0.71993736 0.82895302 0.83285788
     0.79335138 0.82895302 0.7698255  0.84989778 0.84788216 0.6413782
     0.78707116 0.7991425  0.74523195 0.81162359 0.81106264 0.8198735
     0.81643527 0.85237895 0.02672203 0.38118643 0.05340075 0.59294381
     0.36885321 0.59221025 0.28232583 0.26365142 0.34419223 0.57829491
     0.3733641  0.58710354 0.55107857 0.48216686 0.56268236 0.32459291
     0.55751057 0.61072967 0.46149897 0.6115753  0.32909528 0.58968904
     0.31046301 0.49424779 0.5000461  0.38548959 0.12629433 0.11798213
     0.55293611 0.50620254 0.59466094 0.56000896 0.61972579 0.26087292
     0.54077013 0.41598629 0.16655431 0.48935747 0.60716023 0.61436443
     0.59560929 0.50352722 0.62444848 0.29200997 0.62754454 0.60657448
     0.62205599 0.55780204 0.13937138 0.63064081 0.49927538 0.23225278
     0.61193633 0.36075942 0.5577792  0.54384277 0.46682151 0.55917348
     0.44076207 0.56152256 0.26062588 0.22965423 0.55509948 0.28503067
     0.02635881 0.39825264 0.42110831 0.49486598 0.48341063 0.32868889
     0.6070348  0.33355947 0.51237366 0.20297372 0.580154   0.57818326
     0.30904249 0.25226992 0.45434264 0.51608826 0.56017398 0.48442397
     0.46255248 0.13900039 0.05328614 0.55186784 0.45549975 0.3887791
     0.35124673 0.53444618 0.5702338  0.41025549 0.23225278 0.61324746
     0.5670778  0.42513648 0.10417086 0.31493016 0.35245379 0.18544229]
    

Nilai paling bagus adalah 1 dan nilai paling jelek adalah -1. Nilai
mendekati 0 menunjukkan klaster yang tumpang tindih.

v_measure Score
~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.metrics import v_measure_score
    print(v_measure_score(Y, y_kmeans))


.. parsed-literal::

    0.7581756800057784
    

Skor antara 0.0 sampai dengan 1.0. Skor 1.0 menunjukkan labelisasi total

.. code:: ipython3

    from sklearn.cluster import k_means
    k_means(X, n_clusters=3, sample_weight=None, init='k-means++', precompute_distances='auto', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, n_jobs=None, algorithm='auto', return_n_iter=True)




.. parsed-literal::

    (array([[5.006     , 3.418     , 1.464     , 0.244     ],
            [5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
            [6.85      , 3.07368421, 5.74210526, 2.07105263]]),
     array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
            2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
            2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1]),
     78.94084142614602,
     6)



Fungsi k_means di atas akan memberikan *array* berisi *Centroid*

*Array* berisi *Label*, *Inertia* dan jumlah *Iterasi*

Releases

No releases published

Packages

No packages published

Languages