05c My first ML model — data cleaning

Digital Nomad PhD
3 min readFeb 22, 2021

First I am going to investigate the Last Traded Volume feature. After some re-plotting I find that there is no data from before mid 2016. I have a decision to make. I can either remove the feature entirely (Many ML models cannot handle missing data) or I can remove all the data from before mid 2016 from all the features. I decide I would like to keep the Last Traded Volume feature as this sounds like it could be an important feature. So I remove all the prior to 2017 from my dataframe

#investigate LastTradedVolume data
plt.figure(figsize=(16, 6))
sns.lineplot(x='snapshotTime', y = 'lastTradedVolume', data = prices_df[3000:])
#removing data
after_start_date = prices_df["snapshotTime"] >= '2016-12-30'
prices_df = prices_df.loc[after_start_date]

Next I investigate the Last Traded features. I find that these features are not well populated with data so I drop them.

prices_df = prices_df.drop(['openPrice.lastTraded', 'highPrice.lastTraded','closePrice.lastTraded' , 'lowPrice.lastTraded'], axis=1)

This is what my data looks like now.

The last Traded volume still looks funky, dropping down to zero. After a bit of investigation I found that on Sundays (day 6) the value is always 0. I decide to replace the zeros with the previous non-zero number.

#adding in days of week feature. 
prices_df = prices_df.set_index('snapshotTime')
prices_df['days_of_week'] = prices_df.index.dayofweek
sns.histplot(data = prices_df, x='days_of_week', bins=7)
#looking at data by day of week find that day 6 (sunday) has lots of zeros.
plt.clf()
sns.lineplot( x='snapshotTimeUTC', y = 'lastTradedVolume', data = prices_df.loc[prices_df['days_of_week'] == 6])
#data cleaning - replace data on day 6 last traded volume with previous non zero data (day 4)
prices_df['lastTradedVolume'] = prices_df['lastTradedVolume'].mask(prices_df['lastTradedVolume'] == 0).ffill(downcast='infer')
#plot
plt.clf()
sns.lineplot(x='snapshotTimeUTC', y = 'lastTradedVolume', data = prices_df)

This looks better. Moving on.

In machine learning, more is not always better. It is common practise to remove features that are highly correlated. The next step is to find the highly correlated features.

corr = prices_df.corr()
# Plot heatmap of correlation matrix
sns.heatmap(corr, annot= True)
plt.yticks(rotation=0); plt.xticks(rotation=90) # fix ticklabel directions and size
plt.show() # show the plot

I find that bid, ask, open, close, high are all perfectly correlated so I do not need them all. Dropping…

#bid, ask, open, close, high and low all perfectly correlated. DO not need them all.
prices_df = prices_df.drop(['snapshotTimeUTC', 'openPrice.ask', 'closePrice.bid', 'highPrice.bid', 'highPrice.ask', 'lowPrice.bid', 'lowPrice.ask'], axis=1)
print(prices_df.head(0))

Thats it! Data cleaning done.

--

--

Digital Nomad PhD

Currently I am part of the 9–5 set. I am transitioning to the digital nomad lifestyle and will be developing my income streams over the next few years.