Time series forecasting: When to invest for Bitcoin

Goals:

Our task in this project is to build a model with RNNs to predict bitcoin closing price the following hour given the previous 24 hours.

Here Are some questions to consider:

  • Are all of the data points useful?
  • Are all of the data features useful?
  • Should you rescale the data?
  • Is the current time window relevant?
  • How should you save this preprocessed data?

Prerequisites:

  • Pandas
  • numpy
  • tensorflow keras
  • seaborn

Reference: Time series forecasting

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

EDA

We start by exploring our data, in this project we have two csv files one from bitstamp and the other from coinbase, first thing first, we load the coinbase csv using pandas.

df = pd.read_csv('./data/coinbase.csv')

Features

Each row in this dataframe represents bitcoin price in the given minute

  • Timestamp: The unix time for each price
  • Open: Opening price for each minute
  • High: The Highest price for each minute
  • Low: The lowest price for each minute
  • Close: Closing price for each minute
  • Volume_(BTC): the ammount of btc transacted in the given minute
  • Volume_(Currency): the ammount of btc in USD transacted in the given minute
df
TimestampOpenHighLowCloseVolume_(BTC)Volume_(Currency)Weighted_Price
01417411980300.00300.00300.00300.000.0100003.000000300.000000
11417412040NaNNaNNaNNaNNaNNaNNaN
21417412100NaNNaNNaNNaNNaNNaNNaN
31417412160NaNNaNNaNNaNNaNNaNNaN
41417412220NaNNaNNaNNaNNaNNaNNaN
...........................
209975515468985204006.014006.574006.004006.013.38295413553.4330784006.390309
209975615468985804006.014006.574006.004006.010.9021643614.0831684006.017232
209975715468986404006.014006.014006.004006.011.1921234775.6473084006.003635
209975815468987004006.014006.014005.504005.502.69970010814.2418984005.719991
209975915468987604005.514006.014005.514005.991.7527787021.1835464005.745614

2099760 rows × 8 columns

df.isnull().sum()
Timestamp                 0
Open                 109069
High                 109069
Low                  109069
Close                109069
Volume_(BTC)         109069
Volume_(Currency)    109069
Weighted_Price       109069
dtype: int64

Missing values

We notice that that we have null values in some rows, we use the fillna method from pandas with the bfill strategy (filling by previous value). i’ve used it because our missing data is at the start.

dfc=df.fillna(method="bfill")
dfc.isnull().sum()
Timestamp            0
Open                 0
High                 0
Low                  0
Close                0
Volume_(BTC)         0
Volume_(Currency)    0
Weighted_Price       0
dtype: int64

Feature Selection

We notice that we have 4 columns that describe the price and 2 columns for the volume of coin, but do we need that many? As we can see, the correlation between the price features is very high this indicates that we need only the closing price.

dfcd = dfc.drop('Timestamp', axis=1)
dfcd.corr()
OpenHighLowCloseVolume_(BTC)Volume_(Currency)Weighted_Price
Open1.0000000.9999980.9999970.9999970.1554210.3930500.999999
High0.9999981.0000000.9999950.9999980.1560120.3938740.999999
Low0.9999970.9999951.0000000.9999980.1546140.3918700.999999
Close0.9999970.9999980.9999981.0000000.1553000.3928510.999999
Volume_(BTC)0.1554210.1560120.1546140.1553001.0000000.7098970.155303
Volume_(Currency)0.3930500.3938740.3918700.3928510.7098971.0000000.392863
Weighted_Price0.9999990.9999990.9999990.9999990.1553030.3928631.000000
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

As we can see in all the features of the price correlate visually as well.

figure, axis = plt.subplots(2, 2)
axis[0, 0].plot(dfc['Timestamp'], dfc['Open'])
axis[0, 0].set_title('Open')
axis[0, 1].plot(dfc['Timestamp'], dfc['Weighted_Price'])
axis[0, 1].set_title('Weighted Price')
axis[1, 0].plot(dfc['Timestamp'], dfc['High'])
axis[1, 0].set_title('High')
axis[1, 1].plot(dfc['Timestamp'], dfc['Low'])
axis[1, 1].set_title('Low')
plt.show()

png

plt.plot(dfc['Timestamp'], dfc['Close'])

png

Normalization

Since the range of data goes from very low values to really high values especially with the case of the volume traded, that’s why we will implement normalization.

plt.scatter(dfc['Timestamp'], dfc['Volume_(BTC)'])

png

plt.scatter(dfc['Timestamp'], (dfc['Volume_(BTC)']-dfc['Volume_(BTC)'].mean())/df['Volume_(BTC)'].std())

png

dfc.hist()

png

Here we imported the bitstamp data and found that it has the the same range with more values, so we will apply the same techniques and use this data instead

dfb = pd.read_csv('data/bitstampUSD_1-min_data_2012-01-01_to_2020-04-22.csv')
dfb = dfb.fillna(method="bfill")
dfb
TimestampOpenHighLowCloseVolume_(BTC)Volume_(Currency)Weighted_Price
013253179204.394.394.394.390.4555812.0000004.390000
113253179804.394.394.394.3948.000000210.7200004.390000
213253180404.394.394.394.3948.000000210.7200004.390000
313253181004.394.394.394.3948.000000210.7200004.390000
413253181604.394.394.394.3948.000000210.7200004.390000
...........................
436345215875133606847.976856.356847.976856.350.125174858.1286976855.498790
436345315875134206850.236856.136850.236850.891.2247778396.7814596855.763449
436345415875134806846.506857.456846.026857.457.08916848533.0890696846.090966
436345515875135406854.186854.986854.186854.980.01223183.8316046854.195090
436345615875136006850.606850.606850.606850.600.01443698.8969066850.600000

4363457 rows × 8 columns

Selecting the appropriate time range

We notice in the range of time in the data that the low range is very irrelevant (BTC at the time wasn’t as mainstream as it now) that’s why i decided to remove it out and keep the window where the ranges are relevant.

plt.plot(dfb['Timestamp'], dfb['Close'])

png

sdfb = dfb[dfb['Timestamp'] >= 1.50*1e9]

Normalizing the new data

def norm(df):
    return (df-df.mean())/df.std()
plt.plot(sdfb['Timestamp'], sdfb['Close'])

png

plt.plot(sdfb['Timestamp'], norm(sdfb['Close']))

png

plt.scatter(sdfb['Timestamp'], sdfb['Volume_(BTC)'])

png

plt.scatter(sdfb['Timestamp'], norm(sdfb['Volume_(BTC)']))

png

plt.scatter(sdfb['Timestamp'], sdfb['Volume_(Currency)'])

png

plt.scatter(sdfb['Timestamp'], norm(sdfb['Volume_(Currency)']))

png

Applying feature selection on the new data

sdfb.corr()
TimestampOpenHighLowCloseVolume_(BTC)Volume_(Currency)Weighted_Price
Timestamp1.0000000.1064160.1059710.1070770.106404-0.094584-0.0820170.106568
Open0.1064161.0000000.9999940.9999940.9999910.0213130.2026710.999996
High0.1059710.9999941.0000000.9999890.9999940.0223290.2037720.999996
Low0.1070770.9999940.9999891.0000000.9999940.0201110.2013580.999997
Close0.1064040.9999910.9999940.9999941.0000000.0212020.2025410.999996
Volume_(BTC)-0.0945840.0213130.0223290.0201110.0212021.0000000.9145750.021153
Volume_(Currency)-0.0820170.2026710.2037720.2013580.2025410.9145751.0000000.202488
Weighted_Price0.1065680.9999960.9999960.9999970.9999960.0211530.2024881.000000
nsdfb = sdfb.drop(['Open', 'High', 'Low', 'Weighted_Price', 'Timestamp'], axis=1)
nsdfb.corr()
CloseVolume_(BTC)Volume_(Currency)
Close1.0000000.0212020.202541
Volume_(BTC)0.0212021.0000000.914575
Volume_(Currency)0.2025410.9145751.000000
df = nsdfb.drop(['Volume_(Currency)'], axis=1)
df.corr()
CloseVolume_(BTC)
Close1.0000000.021202
Volume_(BTC)0.0212021.000000
df
CloseVolume_(BTC)
29048962315.971.569825
29048972315.943.100000
29048982315.971.592002
29048992315.992.091700
29049002315.970.582457
.........
43634526856.350.125174
43634536850.891.224777
43634546857.457.089168
43634556854.980.012231
43634566850.600.014436

1458561 rows × 2 columns

Data transformation

Since every row in our data represents a minute and we want our model to process the result from 24 hour data, we group the data by every 60 entry and we simplify the model.

df = df.groupby(np.arange(len(df))//60).mean()
df.hist()

png

Splitting

We split our data into training, validation and testing sets for the model to train with and so we validate it

column_indices = {name: i for i, name in enumerate(df.columns)}
 
n = len(df)
train_df = df[0:int(n*0.7)]
val_df = df[int(n*0.7):int(n*0.9)]
test_df = df[int(n*0.9):]
 
num_features = df.shape[1]
train_mean = train_df.mean()
train_std = train_df.std()
 
train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std
df_std = (df - train_mean) / train_std
df_std = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_std)
_ = ax.set_xticklabels(df.keys(), rotation=90)

png

Making dataset

Tensorflow offers a very convenient api for datasets, we have a keras method to easily create a tf.data.dataset specifically for time series data with shape (batches, batch_size, sequence_length, features) after that we apply a window over the data which would split the sequence into a tuple of two sequences first element represents the past 24 hours and the other one represents the label or the next our.

def split_window(batch):
    inputs = batch[:, :24, :]
    labels = batch[:, 24, 0]
    return inputs, labels
def make_dataset(data):
    data = np.array(data, dtype=np.float32)
    ds = tf.keras.utils.timeseries_dataset_from_array(
      data=data,
      targets=None,
      sequence_length=25,
      sequence_stride=1,
      shuffle=False,
      batch_size=32)
 
    ds = ds.map(split_window)
 
    return ds
train_dataset = make_dataset(train_df)
val_dataset = make_dataset(val_df)
test_dataset = make_dataset(test_df)
2023-01-11 14:37:35.295207: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-11 14:37:35.295229: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-11 14:37:35.295250: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (archpc): /proc/driver/nvidia/version does not exist
2023-01-11 14:37:35.295495: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
train_dataset.save('./datasets/train')
val_dataset.save('./datasets/val')
test_dataset.save('./datasets/test')

Modeling,Training and Evaluation

For modeling we will be using tensorflow keras, but first we need to create a tensorflow.data.dataset object for our RNN to consume, we also perform splitting on the dataset to make it into a tuple of (inputs, prediction) with the input being 24 time sequences (24 hours) and prediction being 1 time sequence (1 hour).

lstm_model = tf.keras.models.Sequential([
    # Shape [batch, time, features] => [batch, time, lstm_units]
    tf.keras.layers.LSTM(32, return_sequences=False),
    # Shape => [batch, time, features]
    tf.keras.layers.Dense(units=1)
])
def compile_and_fit(model, train, val, epochs=20, patience=2):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                    patience=patience,
                                                    mode='min')
 
    model.compile(loss=tf.keras.losses.MeanSquaredError(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=[tf.keras.metrics.MeanAbsoluteError()])
 
    model.fit(train, epochs=epochs,
                      validation_data=val,
                      callbacks=[early_stopping])

As we can see early stopping stopped the training when it converged

compile_and_fit(lstm_model, train_dataset, val_dataset)
Epoch 17/20
532/532 [==============================] - 6s 11ms/step - loss: 0.0018 - mean_absolute_error: 0.0221 - val_loss: 0.0013 - val_mean_absolute_error: 0.0193

Evaluation

We will visually evaluate the model by plotting some examples and their predictions.

example_window = tf.stack([np.array(test_df[:25]),
                           np.array(test_df[100:100+25]),
                           np.array(test_df[200:200+25])])
example, _ = split_window(example_window)
predictions = lstm_model.predict(example)
1/1 [==============================] - 0s 414ms/step
predictions
array([[0.45651057],
       [0.64009064],
       [0.6150275 ]], dtype=float32)
def unormalize_res(df):
    return df * train_std[0] + train_mean[0]
pred = unormalize_res(predictions[0])
exp = unormalize_res(np.array(test_df[:25])[:, 0])
plt.scatter(list(range(25)), exp)
plt.scatter([24], pred)

png

pred = unormalize_res(predictions[1])
exp = unormalize_res(np.array(test_df[100:100+25])[:, 0])
plt.scatter(list(range(25)), exp)
plt.scatter([24], pred)

png

pred = unormalize_res(predictions[2])
exp = unormalize_res(np.array(test_df[200:200+25])[:, 0])
plt.scatter(list(range(25)), exp)
plt.scatter([24], pred)

png

Conclusion

As we know bitcoin predictability is very low as it is very linked to other trends such as social media, inflation and overall news, but from our random examples we get overall good results, we can do use a smaller time frame or acquire more data and add more features such as bitcoin sentiment and inflation.