Despite the many wonderful materials for the Data Science for example, from Open Science Data, I continue to collect scraps from the feast of reason and continue to share with you their experiences on the development of skills in machine learning and data analysis from scratch.

In recent articles, we looked at a couple of tasks according to the classification in the process of sweat and blood producing their own data, now it's time regression. Because nothing lighting this time at hand was not, I decided to scrape the bottom of the other bottom of the barrel.

I remember in one of the articles I campaigned readers to look towards domestic open data. But since I'm not a young lady of advertising "buttermilk for digestion" or shampoo with horsepower, I couldn't advise anything that is not experienced.

Where to start? Of course with open data of the government of the Russian Federation, there is in fact a whole Ministry there. My acquaintance open data of the government of the Russian Federation, about the same as the illustration to this article. No well not that I wasn't interesting registry Cinema city Novy Urengoy or a list of rolling equipment to the rink in Tula, just for the regression task, they are not very good.

If you dig and think on the website OD the government of the Russian Federation it is possible to find something worthwhile, just not very easily.

Treasury Data I also decided to leave for later.

Perhaps, most of all I liked open data of Moscow government, there I found a couple of potential problems and chose in the end Details about registration of acts of civil status in Moscow by year

What emerged from the implementation of minimum skills in the field of linear regression in summary form to look at GitHuband, of course, looking at kat.

UPD: Added section – "Bonus"

the

Introduction

As usual, in the beginning of the article will talk about the skills necessary for understanding this article.
You will need:

the

to Read a tutorial or to run a simple training course on Machine learning
to understand a Little bit of Python
to Have almost zero knowledge in mathematics

If you are very new in the field of data analysis and machine learning view past articles of the series in order, there is every article written by the "hot pursuit" and you will understand if You should spend time on Data Science.

All of the previously prepared article below.

Other articles of the cycle

1. Learn the basics:
the

"Shop Data big and small!" — (a Brief overview of courses on Data Science from a Cognitive Class)
"Now he's counted" or data Science (Data Science from Scratch)
"iceberg is Oscar!" or how I tried to learn the basics of DataScience at kaggle
""the little Engine that could!" or "Specialization, Machine learning and data analysis", through the eyes of a beginner in Data Science

2. Practice first skills
the

Well, as promised earlier now article in this series will be completed with content.

Contents:

Part I: "to Marry — not the bast wear" — acquisition and primary data analysis.
Part II: "One man is no man" — Regression for 1 basis
Part IV: "all is Not gold that glitters" — Add signs
Part V: "Cut new Abaya, and to the old try!" — Trend forecast
Bonus — Increases the accuracy, at the expense of another approach to months

So, let's get more to the task. It is our goal to dig up some data set sufficient to demonstrate the basic techniques of linear regression and determine that we will predict.

This time will be brief and I will not deviate from the topic examined only the linear regression (you probably know about the existence of other methods)

the

Part I: "to Marry — not a bast Shoe to wear" — acquisition and primary data analysis

Unfortunately open data of the government of Moscow is not so vast and boundless as the budget spent on the improvement of the program "My street", but nonetheless something good to find managed.

Dynamics of registration of civil status acts, we should be fine.

It is nearly a hundred entries, divided by months in which there are data on the number of weddings, births, establishment of parenthood, deaths, changes of name, etc.
To solve the problem of regression is fine.

All the code hosted on GitHub
Well then we dissect it right now.

To begin importing the library:

the

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Then load the data. The Pandas library allows you to download files from a remote server that in General is just great, provided of course that the portal will not change the algorithm of redirection pages.

(I hope that the download link in the code, will not be covered if it stops working please write in "personal", so I can update)

the

#download
df = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')
#look at the data
df.head(12)

Look at the data:
the the the the the the the the the the the the the

ID	global_id	Year	Month	StateRegistrationOfBirth	StateRegistrationOfDeath	StateRegistrationOfMarriage	StateRegistrationOfDivorce	StateRegistrationOfPaternityExamination	StateRegistrationOfAdoption	StateRegistrationOfNameChange	TotalNumber
1	37591658	2010	January	9206	10430	4997	3302	1241	95	491	29762
2	37591659	2010	February	9060	9573	4873	2937	1326	97	639	28505
3	37591660	2010	March	10934	10528	3642	4361	1644	147	717	31973
4	37591661	2010	APR	10140	9501	9698	3943	1530	128	642	35572
5	37591662	2010	may	9457	9482	3726	3554	1397	96	492	28204
6	62353812	2010	Jun	11253 indoor	9529	9148	3666	1570	130	556	35852
7	62353813	2010	Jul	11477	14340	12473	3675	1568	123	564	44220
8	62353814	2010	August	10302	15016	10882	3496	1512	134	578	41920
9	62353816	2010	Sep	10140	9573	10736	3738	1480	101	686	36454
10	62353817	2010	Oct	10776	9350	8862	3899	1504	89	687	35167
11	62353818	2010	Nov	10293	9091	6080	3923	1355	97	568	31407
12	62353819	2010	Dec	10600	9664	6023	4145	1556	124	681	32793

If we want to use information about months, they must be translated into an understandable model format, scikit-learn has its own methods, but for reliability we will make (for not a lot of work) and at the same time remove a couple of useless in our case, the columns ID and junk.
note: in this case, the graph of the Month, I think it would be more appropriate to use one-hot encoding, but because in this case us quality predictions are not much interested, I will leave it as is.
UPD: broke down and added a corrected version in the Bonus

the

#code months
d={'January':1, 'February':2, 'March':3, 'APR':4, 'may':5, 'June':6, 'July':7,
'August':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12}
df.Month=df.Month.map(d)

#delete some unuseful columns
df.drop(['ID','global_id','Unnamed: 12'],axis=1,inplace=True)

#look at the data
df.head(12)

Because I'm not sure that table view all opens normally look at the data with the help of pictures.

Build a pair of charts, from which it will be clear which columns of our table, linearly independent from each other. However, once all of the data we will not consider, so you have something to add, so first get some data.
A fairly simple way to highlight (delete) a subset of columns in a pandas Dataframe is just to make selection according to the desired columns:

the

columns_to_show = ['StateRegistrationOfBirth', 'StateRegistrationOfMarriage', 
'StateRegistrationOfPaternityExamination', 'StateRegistrationOfDivorce','StateRegistrationOfDeath']
data=df[columns_to_show]

Well now you can and schedule to build.

the

 grid = sns.pairplot(data)

A good practice is to feature scaling, not to compare horses with the weight of the stacks of hay and the average atmospheric pressure during the month.
In our case, all data are presented in the same units (number of registered acts), but let's see what will change the scaling.

the

# change scale of features
scaler = MinMaxScaler()
df2=pd.DataFrame(scaler.fit_transform(df))
df2.columns=df.columns
data2=df2[columns_to_show]
grid2 = sns.pairplot(data2)

Almost nothing, but for reliability as we take the scaled data.

the

Part II: "One man is no man" — Regression for 1 basis

Look at the pictures and see that the best straight line to describe the relationship between two signs StateRegistrationOfBirth and StateRegistrationOfPaternityExamination that is not very surprising (the more often you check "paternity", the more often register children).
Prepare data, select the two column feature and the target feature, then use ready-made libraries divide the data into training and test sample (manipulation at the end of the code was necessary to bring the data to the desired form)

the

X = data2['StateRegistrationOfBirth'].values
y = data2['StateRegistrationOfPaternityExamination'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train=np.reshape(X_train,[X_train.shape[0],1])
y_train=np.reshape(y_train,[y_train.shape[0],1])
X_test=np.reshape(X_test,[X_test.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

It is important to note that despite the apparent ability to bind to the chronology, we for the purpose of demonstration, consider the data as a collection of records without reference to time.

"Feed" our data model and look at the coefficient of our basis, but also to evaluate the accuracy of the fit of the models using R^2 (coefficient of determination).

the

#teach, model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))

It was not good, on the other hand much better than if we "poked."

Coefficients: [[ 0.78600258]]
 
Score: 0.611493944197

View it on the charts, initially, the training data:

the


plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, lr.predict(X_train), color='blue',
linewidth=3)

plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration Examination OfPaternity')
plt.title="Regression on train data"

Now for control:

the


plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, lr.predict(X_test), color='green',
linewidth=3)

plt.xlabel('StateRegistrationOfBirth')
plt.ylabel('State Registration Examination OfPaternity')
plt.title="Regression on test data"

the

Part III: "One head is good, a lot better" — Regression on several grounds regularization

To make it more interesting let's choose a different objective function, which is not obviously linearly dependent on the signs.
To match the name of the article, as objective function we choose the registration of marriages.
And all other columns from a set of images paired diagrams make signs.

the


#get main data
columns_to_show2=columns_to_show.copy()
columns_to_show2.remove("StateRegistrationOfMarriage")

#get data for a model
X = data2[columns_to_show2].values
y = data2['StateRegistrationOfMarriage'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

Teach first is simply a linear regression model.

the

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))

Get the result is slightly worse than in the past case (not surprisingly)

Coefficients: [[-0.03475475 0.97143632 -0.44298685 -0.18245718]]
 
Score: 0.38137432065

To combat overfitting and/or selection of signs, together with the linear regression model typically use the mechanism of regularization, in this article we consider the mechanism of Lasso (L1 – regularization)

The larger the regularization coefficient alpha, the more the model penalizes large values of the signs up to bring some of the coefficients of the regression equation to zero.

the


#let's look at the different alpha parameter:

#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print(' Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

#Small
Rid=linear_model.Lasso (alpha = 0.000000001)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00025)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

It should be noted that I do right are the bad things, drove the regularization coefficient under the test data, in reality it wouldn't be, but we purely for demonstration will do.

Look at the output:

Appha: 0.01
 
Coefficients: [ 0. 0.46642996 -0. -0. ]
 
Score: 0.222071102783
 

 
Appha: 1e-09
 
Coefficients: [-0.03475462 0.97143616 -0.44298679 -0.18245715]
 
Score: 0.38137433837
 

 
Appha: 0.00025
 
Coefficients: [-0.00387233 0.92989507 -0.42590052 -0.17411615]
 
Score: 0.385551648602

In this case, the model with regularization is not much improves the quality, try to add more features.

the

Part IV: "all is Not gold that glitters" — the characteristics that are Added

Add the apparently useless symptom of "total registrations", why is that obvious? Now see for yourself.

the

columns_to_show3=columns_to_show2.copy()
columns_to_show3.append("TotalNumber")
columns_to_show3

X = df2[columns_to_show3].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

For starters, look at the results without regularization

the


lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))

Coefficients: [[-0.45286477 -0.08625204 -0.19375198 -0.63079401 1.57467774]]
 
Score: 0.999173764473

That's crazy! Almost 100% accuracy!
How this characteristic could be called useless?!
Well let's think sensibly, our number of marriages is included in the total amount, so if we have information about other characteristics, then the accuracy is nearing 100%. In practice, this is not particularly useful.

Let's go to our Lasso
First, we choose a small regularization coefficient:

the

#Optimal (for these test data)
Rid=linear_model.Lasso (alpha = 0.00015)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

Appha: 0.00015
 
Coefficients: [-0.44718703 -0.07491507 -0.1944702 -0.62034146 1.55890505]
 
Score: 0.999266251287

Well, nothing much has changed, we're not interested, let's see what happens when it increase.

the


#large
Rid=linear_model.Lasso (alpha = 0.01)
Rid.fit(X_train, y_train)
print('\n Appha:', Rid.alpha)
print(' Coefficients:', Rid.coef_)
print(' Score:', Rid.score(X_test,y_test))

Appha: 0.01
 
Coefficients: [-0. -0. -0. -0.05177979 0.87991931]
 
Score: 0.802210158982

Let's purely out of curiosity, how could explain the percentage registration of marriages only the total number of records.

the

X_train=np.reshape(X_train[:,4],[X_train.shape[0],1])
X_test=np.reshape(X_test[:,4],[X_test.shape[0],1])

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_train,y_train))

Coefficients: [ 1.0571131]
 
Score: 0.788270672692

Well not bad, but objectively less than with the other signs.
Let's look at graphics:

the

# plot train data for
plt.figure(figsize=(8,10))
plt.subplot(211)

plt.scatter(X_train, y_train, color='black')
plt.plot(X_train, lr.predict(X_train), color='blue',
linewidth=3)

plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on train data"

# plot for test data
plt.subplot(212)
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, lr.predict(X_test), '--', color='green',
linewidth=3)

plt.xlabel('Total Number of Registration')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression on test data"

Let's try to the original dataset, now add another useless sign
State Registration Of Name Change (name change), you can build a model and see what proportion of the data explains this symptom (small believe me).
And we'll find the data and train the model on the old 4x features and useless

the


columns_to_show4=columns_to_show2.copy()
columns_to_show4.append("StateRegistrationOfNameChange")

X = df2[columns_to_show4].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))

Coefficients: [[ 0.06583714 1.1080889 -0.35025999 -0.24473705 -0.4513887 ]]
 
Score: 0.285094398157

Won't try regularization, it radically will not change anything, as you can see this symptom only spoils us all.

Let's finally choose a useful feature.

Everyone knows that there is a hot season for weddings (summer and early fall), and there is the quiet season (winter).

the way I was surprised that in may celebrates few weddings.

the

#get data
columns_to_show5=columns_to_show2.copy()
columns_to_show5.append("Month")

#get data for model
X = df2[columns_to_show5].values
# y hasn't changed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1])
#teach, model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_)
print('Score:', lr.score(X_test,y_test))

Coefficients: [[-0.10613428 0.91315175 -0.55413198 -0.13253367 0.28536285]]
 
Score: 0.472057997208

Good improving quality and above all in line with common logic.

the

Part V: "Cut new Abaya, and to the old try!" — Trend forecast

The last thing we probably need is to look at the linear regression as a tool for predicting the trend. In previous chapters we took the data randomly, that is, in the training set were data from the entire time range. This time we will share data on the past and future and see if we will be able to predict

For convenience we will count the time in months since January 2010, this will write a simple anonymous function, which converts the data eventually will replace the column on the number of months.

Study will be on data up to 2016, and the future for us will be everything that starts in 2016.

the

#get data
df3=df.copy()

#get new column
df3.Year=df.Year.map(lambda x: (x-2010)*12)+df.Month
df3.rename(columns={'Year': 'Month'}, inplace=True)

#get data for model
X=df3[columns_to_show5].values
y=df3['StateRegistrationOfMarriage'].values
train=[df3.Months<=72]
test=[df3.Months>72]
X_train=X[train]
y_train=y[train]
X_test=X[test]
y_test=y[test]
y_train=np.reshape(y_train,[y_train.shape[0],1])
y_test=np.reshape(y_test,[y_test.shape[0],1]) 

#teach, model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))

Coefficients: [ 2.60708376 e-01 1.30751121 e+01 -3.31447168 e+00 -2.34368684 e-01
 
2.88096512 e+02].
 
Score: 0.383195050367

As can be seen in this breakdown, the accuracy is somewhat reduced and nevertheless, the prediction quality is much better than a finger to the sky.
Verify by looking at graphics.

the


plt.figure(figsize=(of 9.23))

# plot train data for
plt.subplot(311)

plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', linewidth=2)
plt.xlabel('Months (from 01.2010)')

plt.title="Regression on train data"

# plot for test data
plt.subplot(312)

plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', linewidth=2)
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')
plt.title="Regression (prediction) on test data"

# plot for all data
plt.subplot(313)

plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')
plt.legend()

The graphs in blue presented "past," green "future," and purple tie.
So, it is clear that our model perfectly describes the point, but at least takes into account seasonal patterns.

Thus, it is hoped that in the future, according to reports the model will have to somehow Orient, the registration of marriages.

Although trend analysis is where the more advanced tools that are beyond the scope of this article (and in my opinion, elementary skills in Data Science)

the

Conclusion

Well, here we have considered the problem of recovery of a regression, I suggest you look for other relationships on the open data portals of the state structures of the country, maybe you'll find some interesting dependence. As a "challenge" offer, you can dig up anything on the open data portal of the Republic of Belarus opendata.by
At the end of the picture, based on the communication Alexander Lukashenko with reporters answers to inconvenient questions.

the

Bonus — Increases the accuracy, at the expense of another approach to months

Colleagues have left helpful comments with recommendations to improve the quality of the prediction.

In short all the proposal was to ensure that in trying to simplify everything, I correctly coded the column "Month" (and indeed it is). To improve this we'll try two ways.

Option 1 — One-hot encoding, where each value of the month is your sign.

To begin, download the original sign without revisions

the

df_base = pd.read_csv('https://op.mos.ru/EHDWSREST/catalog/export/get?id=230308', compression='zip', header=0, encoding='cp1251', sep=';', quotechar='"')

Then apply the one-hot encoding implemented in the library pandas data frame (get_dummies function), delete unneeded columns and start re-training the model and drawing schedules.

the

#get data for model

df4=df_base.copy()
df4.drop(['Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber','StateRegistrationOfNameChange','StateRegistrationOfAdoption'],axis=1,inplace=True)
df4=pd.get_dummies(df4,prefix=['Month'])

X=df4.values
X_train=X[train]
X_test=X[test]

#teach, model and get predictions

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))

# plot for all data
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')

Get

Coefficients: [ 2.18633008 e-01 -1.41397731 e-01 4.56991414 e-02 -5.17558633 e-01
 
4.48131002 e+03 -2.94754108 e+02 -1.14429758 e+03 3.61201946 e+03
 
2.41208054 e+03 -3.23415050 e+03 -2.73587261 e+03 -1.31020899 e+03
 
4.84757208 e+02 3.37280689 e+03 -2.40539320 e+03 -3.23829714 e+03].
 
Score: 0.869208071831

The quality is really grown up!

Option 2 — target encoding, we will encode every month the average value of the objective function for this month on the training sample (thanks to roryorangepants)

the


#get data for pandas data frame
df5=df_base.copy()

d=dict()

#get we obtain the mean value of Registration Of Marriages by months on the training data
for mon in df5.Month.unique():

d[mon]=df5.StateRegistrationOfMarriage[df5.Month.values[train]==mon].mean()
#d+={} 

df5['MeanMarriagePerMonth']=df5.Month.map(d)
df5.drop(['Month','Year','StateRegistrationOfMarriage','ID','global_id','Unnamed: 12','TotalNumber',


#get data for model
X=df5.values
X_train=X[train]
X_test=X[test]

#teach, model and get predictions
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
print('Coefficients:', lr.coef_[0])
print('Score:', lr.score(X_test,y_test))

# plot for all data
plt.scatter(df3.Months.values[train], y_train, color='black')
plt.plot(df3.Months.values[train], lr.predict(X_train), color='blue', label='train', linewidth=2)

plt.scatter(df3.Months.values[test], y_test, color='black')
plt.plot(df3.Months.values[test], lr.predict(X_test), color='green', label='test', linewidth=2)

plt.title="Regression (prediction) on all data"
plt.xlabel('Months (from 01.2010)')
plt.ylabel('State Registration Of Marriage')

#plot line for link train to test
plt.plot([72,73], lr.predict([X_train[-1],X_test[0]]) , color='magenta',linewidth=2, label='train to test')-

Will receive:

Coefficients: [ 0.16556761 -0.12746446 -0.03652408 -0.21649349 0.96971467]
 
Score: 0.875882918435

Very similar, in terms of quality of result, with the significantly smaller number of used features.

Well, on this hope all.

Here's good-bye picture with Mr. "Copernicum", I hope it offends no-one and will not cause "holivarov" :)

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

"4 weddings and a funeral" or linear regression to analyze the open data of Moscow government

Introduction

Part I: "to Marry — not a bast Shoe to wear" — acquisition and primary data analysis

Part II: "One man is no man" — Regression for 1 basis

Part III: "One head is good, a lot better" — Regression on several grounds regularization

Part IV: "all is Not gold that glitters" — the characteristics that are Added

Part V: "Cut new Abaya, and to the old try!" — Trend forecast

Conclusion

Bonus — Increases the accuracy, at the expense of another approach to months

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Wikia Search — first impressions

Emulator data from GNSS receiver NMEA

mSearch: search + filter for MODX Revolution