Sales prediction for a drugstore chain.
This repository contains codes for the sales predictions for Rossman drugstores.
The data used was available on Kaggle. All additional information below are fictional
The objetives of this project are:
- Perform exploratory data analysis on sales available on dataset.
- Predict the sales for the next 6 weeks from each store of the pharmacy chain.
- Develop a telegram bot that can be acessed by the CEO from a mobile or computer.
Rossmann is a pharmacy chain that operates over 3,000 stores in 7 European countries. The stores are going to be renovated and the CFO needs to know how much can be invested in each one of them.
The Data Scientist was requested to develop a sales prediction model that forecast the sales for the next 6 weeks for each store. Therefore, the telegram bot must return this sales prediction for the given store.
The model developed predicts a gross income of $286.69 MM in the next 6 weeks for the stores available, where the best and worst case scenarios results on $313.65 MM and $259.73 MM, respectively. These scenarios were calculated based on mean absolute percentage error for each store.
- The data available is only from 2013-01-01 to 2015-07-31.
- Stores without information on distance from competitors are considered without competition nearby.
- Seasons of the year:
- Spring starts on March 1st
- Summer starts on June 1st
- Fall starts on September 1st
- Winter starts on December 1st
- Spring starts on March 1st
The variables on original dataset goes as follows:
Variable | Definition |
---|---|
store | unique ID for each store |
days_of_week | weekday, starting 1 as Monday. |
date | date that the sales occurred |
sales | amount of products or services sold in one day |
customers | number of customers |
open | whether the store was open (1) or closed (0) |
promo | whether the store was participating on a promotion (1) or not (0) |
sate_holiday | whether it was a state holiday (a=public holiday, b=easter holiday, c=christmas) or not (0) |
store_type | designates the store model as a, b, c or d. |
assortment | indicates the store assorment as: a=basic, b=extra, c=extended |
competition_distance | distance in meters to the nearest competitor store |
competition_open_since_month | the approximate month competitor was opened |
competition_open_since_year | the approximate year competitor was opened |
promo2 | wheter the store was participating on a consecutive promotion (1) or not (0) |
promo2_since_week | indicates the calendar week the store was participating in promo2 |
promo2_since_year | indicates the year the store was participating in promo2 |
promo2_interval | indicates the intervals in which promo2 started |
Variables created during the project development goes as follow:
Variable | Definition |
---|---|
year | year from date that the sales occurred |
month | month from date that the sales occurred |
day | day from date that the sales occurred |
week_of_year | week of the year from date that the sales occurred, considering the first week of a year a thursday and begins at 1. (int type) |
year_week | week of the year from date that the sales occurred, considering the first week of a year with a monday and begins at 0. (object type, %Y-%W) |
season | season from date that the sales occurred |
competition_open_since | concatenation of 'competition_open_since_year' and 'competition_open_since_month' |
competition_open_timeinmonths | calculates the time in months that competitor has been open based on the purchased date |
promo2_since | concatenation of 'promo2_since_year' and 'promo2_since_week' |
promo2_since_timeinweeks | calculates the time in weeks that promotion began based on the purchased date |
month_map | month from date that the sales occurred as auxiliar feature |
is_promo2 | whether the purchase occurred during an active promo2 (1) or not (0) |
- Data Description
- Feature Engineering
- Data Filtering
- Exploratory Data Analysis
- Data Preparation
- Feature Selection
- Machine Learning Modeling
- Hyper Parameter Fine Tuning
- Model-to-Business Interpretation
- Model Deploy
1. Distance from competitors does not seem to correlate with store sales.
2. Stores sold more in the seconde semester in 2013, but not in 2014.
3. Sales during the sring correspond to 41.41% of total.
Machine learning models used:
- Linear Regression
- Regularized Linear Regression
- Random Forest Regressor
- Xgboost Regressor
Results after cross-validation, where:
MAE = mean absolute error;
MAPE = mean absolut percentage error;
RSME = root mean squared error.
Final xgboost result after fine tunning:
Access telgram bot here.
The objective of this project was develop a prediction model for Rossmann stores. Developing the telegram bot as the data deliverable product successfully satisfies the CFO demands.
- Address missing values in a better way.
- Test other machine learning models.
- Improve messages on telegram bot.
References:
- Blog Seja um Data Scientist
- Dataset Rossmann Store Sales from Kaggle
- Variables meaning on Kaggle