Edge ML tutorial 2 : Auto ML on sequential data

Important: You need at leat 4 Go of RAM memory to run this Notebook. If you are using Docker, you need to set the used amont of RAM memory in the menu Preferences / Advanced

Step #1: prepare a dataset from raw data

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import os 
import re
import glob
import csv
from lib.edgeML import *                                   # let's import the edgeML wrapper 

The following function is used to remove all the punctuation characters from the textual data

In [2]:
def clean(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    cleantext = cleantext.replace('\n', ' ')
    cleantext = cleantext.rstrip()
    cleantext = cleantext.replace(',', '')
    cleantext = cleantext.replace(';', '')
    cleantext = cleantext.replace('.', '')
    cleantext = cleantext.replace('-', '')
    cleantext = re.sub(' +',' ',cleantext)
    return cleantext

The following function transforms a string variable into a list of words.

In [3]:
def toListVar(input_text):
    input_text = input_text.lstrip()
    return input_text.split(' ')

Now, we declare a dataframe and we set the names of the columns

In [4]:
cols = ['topic','places','people','orgs','exchanges','objet','body']
X = pd.DataFrame(columns=cols)

The input files which contain the raw textual data are exploited to build the dataframe

In [5]:
pathFiles = glob.glob('./data/*.sgm')

for path in pathFiles:

    fichier = open(path, "r")
    html_doc = fichier.read()
    soup = BeautifulSoup(html_doc)

    for p in soup.find_all('reuters'):
        currentRow = []
        currentRow.append((clean(str(p.find('topics')))))
        currentRow.append((toListVar(clean(str(p.find('places'))))))
        currentRow.append((toListVar(clean(str(p.find('people'))))))
        currentRow.append((toListVar(clean(str(p.find('orgs'))))))
        currentRow.append((toListVar(clean(str(p.find('exchanges'))))))
        currentRow.append((toListVar(clean(str(p.find('title'))))))

        splitLine = str(p.find('text')).split("</dateline>")
        if len(splitLine) > 1:
            currentRow.append((toListVar(clean(splitLine[1]))))
        else:
            currentRow.append((toListVar(clean(splitLine[0]))))

        X.loc[len(X)] = currentRow   
/opt/conda/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 174 of the file /opt/conda/lib/python2.7/runpy.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

As you can see, the columns of the built dataframe contain lists of words.

In [6]:
X.head(3)
Out[6]:
topic places people orgs exchanges objet body
0 cocoa [elsalvador, usa, uruguay] [] [] [] [BAHIA, COCOA, REVIEW] [Showers, continued, throughout, the, week, in...
1 [usa] [] [] [] [STANDARD, OIL, &ltSRD&gt, TO, FORM, FINANCIAL... [Standard, Oil, Co, and, BP, North, America, I...
2 [usa] [] [] [] [TEXAS, COMMERCE, BANCSHARES, &ltTCB&gt, FILES... [Texas, Commerce, Bancshares, Inc's, Texas, Co...

Now we keep only the 10 most frequent topics as target variable.

In [7]:
listNoTarget = X['topic'].value_counts()[10:].index.tolist()
listNoTarget.append('')
X.loc[X['topic'].isin(listNoTarget), 'topic'] = 'other'

Here is the counts by class values:

In [8]:
X['topic'].value_counts()
Out[8]:
other           13443
 earn            3945
 acq             2362
 crude            408
 trade            361
 moneyfx          307
 interest         285
 moneysupply      161
 ship             158
 grain wheat      148
Name: topic, dtype: int64

Step #2: training a classifier from a sequential dataset

We declare an edgeML classifier

In [9]:
classifier = edgeML(".","sequences")  ## TODO a simplifier 

The function nammed 'cleanVectorVariables' must be used to remove puntuation and special character (i.e. [,],{,},<,>,) from the list variables.

In [10]:
X = classifier.cleanVectorVariables(X)

Now, we split data into train and test set ...

In [12]:
y = X['topic']
X = X.drop('topic', axis=1) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Optionally, you can set the parameters of the sequential rule mining algorithm:

In [13]:
params = {
    'nb_jobs': -1,                  # Number of used threads 
    'duration_of_rule_mining': 10,  # Duration of the extraction of the rules
    'nb_rules': 1000,               # Maximum number of extracted rules
    'less_rules': False,            # If True, the parsimonious mode is triggered. Only independent rules are selected
    'types_vector_var' : {          # By default, the vector varaibles are processed as 'sequences'
        'places' : 'set',           # You can set the type of the vector variables 'vector' grâce à un dico 
        'people' : 'set',
        'orgs' : 'set',
        'exchanges' : 'set',
        'objet' : 'sequence',
        'body' : 'list'
    }
}
In [14]:
classifier.set_params(**params)

Now, the training of the classifier is starting, with the following steps:

  • sequential rules extraction
  • data preparation
  • variable selection
  • learning of the ensemble modèle
In [15]:
classifier.fit(X_train, y_train) 
edgeML -learn-from ./tmp_edgeML_sequences/data.csv -sep ',' -output ./tmp_edgeML_sequences/model.csv -licenseAgreed -nbCore max -durationOfRuleMining 10 -nbRules 1000 

The extracted rules are stored in the following dictionary:

In [16]:
classifier.dico_partition['body']['rules']['body_rule2']
Out[16]:
{'compression gain': 0.176433,
 'proba': [[0.82444,
   0.15112,
   0.007739,
   0.0,
   0.0,
   0.016701,
   0.0,
   0.0,
   0.0,
   0.0],
  [0.05874,
   0.712863,
   0.130919,
   0.016444,
   0.01676,
   0.019211,
   0.008617,
   0.007748,
   0.019764,
   0.008934]],
 'rule_body': [' cts']}

And the learned classifier is already evaluated on the Train set.

In [17]:
print("Accuracy :\t\t"+str(classifier.accuracy_Train))
print("average AUC :\t\t"+str(classifier.avg_AUC_Train))
print("classes :\t\t"+str(classifier.labels))
print("AUC by class :\t\t"+str(classifier.AUC_Train))
Accuracy :		0.87937
average AUC :		0.986474
classes :		[' earn', 'other', ' acq', ' interest', ' moneyfx', ' crude', ' ship', ' grain wheat', ' trade', ' moneysupply']
AUC by class :		[0.990777, 0.947991, 0.982808, 0.98911, 0.981903, 0.996025, 0.98659, 0.998496, 0.991886, 0.999158]

Now we evaluate the classifier on the Test set.

In [18]:
classifier.score(X_test,y_test) 
edgeML -evaluate ./tmp_edgeML_sequences/model.csv -on ./tmp_edgeML_sequences/data.csv -sep ',' -output ./tmp_edgeML_sequences/evaluationReport.csv
Out[18]:
0.434783
In [19]:
print("Accuracy (test) :\t\t"+str(classifier.accuracy_Test))
print("average AUC (test) :\t\t"+str(classifier.avg_AUC_Test))
print("classes :\t\t\t"+str(classifier.labels))
print("AUC by class (test) :\t\t"+str(classifier.AUC_Test))
Accuracy (test) :		0.434783
average AUC (test) :		0.999771
classes :			[' earn', 'other', ' acq', ' interest', ' moneyfx', ' crude', ' ship', ' grain wheat', ' trade', ' moneysupply']
AUC by class (test) :		[0.0, 0.999228, 0.0, 0.999846, 0.0, 0.999969, 0.999846, 0.0, 0.999969, 0.0]

The learned classifier can also be applied on new data, in order to make predictions.

In [24]:
Prev = classifier.predict(X_test)
Prev.head(5)
edgeML -deploy ./tmp_edgeML_sequences/model.csv -on ./tmp_edgeML_sequences/data.csv -sep ',' -inplaceMode -output ./tmp_edgeML_sequences/data_and_prediction.csv
Out[24]:
most_probable_label
0 acq
1 other
2 other
3 other
4 earn
In [25]:
Prev = classifier.predict_proba(X_test)
Prev.head(5)
edgeML -deploy ./tmp_edgeML_sequences/model.csv -on ./tmp_edgeML_sequences/data.csv -sep ',' -inplaceMode -output ./tmp_edgeML_sequences/data_and_prediction.csv
Out[25]:
P( earn|X) P(other|X) P( acq|X) P( interest|X) P( moneyfx|X) P( crude|X) P( ship|X) P( grain wheat|X) P( trade|X) P( moneysupply|X)
0 0.000003 0.000463 0.999511 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
1 0.000733 0.956311 0.032278 0.005962 0.000014 0.000507 0.000005 0.000009 0.004177 0.000004
2 0.000724 0.959933 0.012457 0.001665 0.011370 0.003343 0.007536 0.000328 0.002451 0.000193
3 0.087225 0.881648 0.002086 0.000171 0.000017 0.001802 0.000011 0.000007 0.024792 0.002241
4 0.999654 0.000320 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003

At last, the learned classifier can be used to recode data by using the extracted sequential rules.

In [ ]:
newDataSet = classifier.transform(X_test)
newDataSet.head(5)