Analysis and predictions of the Titanic data set from Kaggle
- Of the various features (e.g. age, social class, ticket fare), engineer them to be suitable to be fed into a machine learning algorithm
- Train various machine learning models to make predictions of survival
- Compare the outcome of the models based on metrics like classifcation reports and confusion matrices
- Trial/Error: What happens to the predictions when particular features are excluded from the training model (are they improved or deteriorated)?
- Statistical theory: What is the intended use of the model, and what attributes of the data make it an appropriate fit?
Primary goal: Find clues to meaningful relationships amongst the data: Identify critical predictors.
- Which features have a stark impact on survival?
- Display these relationships in a simple, obvious manner for a non-technical audience (i.e. visualization).
Secondary goal: Demonstrate interesting relationships amongst the data, even if they do not correlate to critical predictors.
- The definition of "interesting" will depend on the intended audience (e.g. cruiseline selling tickets for, or shipyard building Titanic MarkII; agency conduction safety investigations or social bias in state of emergency).
- Some of the predictors are binary or class-based.
- In this case, I am building a classifier which means some statistical models are less appropriate.