Time Series Analysis using FbProphet library¶

The goal of this project was to predict the next 60 days of traffic on various wiki pages given a few years worth of traffic data. I decided to use Facebook's Prophet library since I've heard good things. The biggest challenge in training this dataset was dealing with the amount of data. The dataset contained 148,000 wiki pages each with a few years worth of data.

Final Approach:¶

  • Setup python multiprocessing on an AWS compute instance to train and predict as many websites in as little time as possible.
  • Clean data using Pandas ffill (forward fill)
  • To Account for outliers in the dataset, fill in values that are greater than std_mult_size*std with a rolling median of "window_size" days
  • Tune the parameters "std_mult_size" and "window_size" on each site using a SMAPE score on validation data.

Code¶

First things first, I loaded the data and split it up between test set and validation set. For the validation set, I used the final 60 days of data:

In [ ]:
N=60
print('Reading data...')
train_1 = pd.read_csv('input/train_1.csv')
train, test = train_1.T.iloc[0:-N,:], train_1.T.iloc[-N:,:]

A lot of data was missing from this set, so I tested various methods for cleaning. I ended up using a forward fill in pandas to fill in missing entries.

In [ ]:
test_cleaned = test.T.fillna(method='ffill').T
train_cleaned = train.T.iloc[:,1:].fillna(method='ffill').T

Now that the data is ready, I began the simple multiprocessing loop.

In [ ]:
print "Running..."
pool = multiprocessing.Pool(24)
result_list = pool.map(train.TuneValidationParams, (i for i in range(148000)))

The "TuneValidationParams" method calls the forecasting method after setting the best window size and std multiple. I manually set only a few options for these variables to cut down on training time. In this method, I identify any outliers in the dataset that fall significantly outside of the standard deviation and replace them with the rolling median value. I then run a prophet prediction on 60 days (N) into the future. Once I've found the std_mult_size and window_size with the lowest SMAPE, I send it to the RunForecast method.

In [ ]:
def TuneValidationParams(self,i):
    std_mult_size=[1.3,1.4,1.5,1.6]
    window_size=[40,50,60]
    min_std=0
    min_win=0
    min_score=200
    continueTuning=True
    for std_mult in std_mult_size:
        for window_ in window_size:
            data=train_cleaned.iloc[:,i].to_frame()
            data.columns = ['visits']
            data['median'] = pd.Series().rolling(min_periods=1, window=window_, center=False).median()
            data.ix[np.abs(data.visits-data.visits.median())>=(std_mult*data.visits.std()),'visits'] \
            = data.ix[np.abs(data.visits-data.visits.median())>=(std_mult*data.visits.std()),'median']
            data.index = pd.to_datetime(data.index)
            X = pd.DataFrame(index=range(0,len(data)))
            X['ds'] = data.index
            X['y'] = data['visits'].values
            X.tail()
            #with suppress_stdout_stderr():
            m = Prophet(yearly_seasonality=True,uncertainty_samples=0)
            m.fit(X)
            future = m.make_future_dataframe(periods=N)
            future.tail()

            forecast = m.predict(future)
            forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
            y_truth = test_cleaned.iloc[:,i].values
            y_forecasted = forecast.iloc[-N:,2].values
            page_name=names[i][0].split("_")[0]
            score=smape(y_truth,y_forecasted,page_name)
            if(score<min_score):
                min_score=score
                min_std=std_mult
                min_win=window_

            del m,future, forecast,  data,X, y_truth, y_forecasted
            if(min_score<30):
                continueTuning=False
                break
        if(continueTuning==False):
            break

    #if(min_score<50):
    RunForecast(i,min_std,min_win)
    return submission

The RunForecast method is basically a subset of the above function, which only runs predictions on a single std_mult and win_size value. I won't include the method here.

Below is the SMAPE method for computing the Symmetric mean absolute percentage error, mathematically represented as:

alt text

SMAPE is the forecast minus actuals divided by the sum of forecasts and actuals as expressed in this formula:

In [ ]:
def smape(self,y_true, y_pred,page_name):
    denominator = (np.abs(y_true) + np.abs(y_pred))
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return 200 * np.median(diff)