Basic Machine Learning / Feature Engineering question

Dear Kagglers, In an other thread where i am not able to respond anymore (https://inclass.kaggle.com/c/deloitte-tackles-titanic/forums/t/9841/getting-high-scores-without-looking-at-actual-data-set), another fellow kaggler stated about getting higher scores: SibSp and Parch variables are also good indicators of survival. You need to notice certain things, like if Husband survived probability of his wife surviving is high. If the child survived, probability of one of his parents surviving is also high. Use it wisely instead of just adding them and creating Family variable. I understand that this is possible to determine for the training set, as i could make a binary/dummy variable that uses their family name to determine if any family member has actually survived (since we have the information about survived/perished). But what happens if i do this on the test / any unknown dataset…


Link to Full Article: Basic Machine Learning / Feature Engineering question