This problem involves the Hitters data set, a part of the ISLR package, on salaries of baseball hitters. Dispose of “NA” observations and perform a log-transformation on the Salary variable. Then create a training set containing a random sample of 200 observations. library(ISLR) Hitters
Salary) Hitters <- na.omit(Hitters) set.seed(1) train <- sample(nrow(Hitters), 200) Fit a tree to the training data, with LogSalary as the response and the other variables as predictors. Is it a classification or regression tree? Use the summary() function to produce summary statistics about the tree. What is the training MSE? How many terminal nodes does the tree have? Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes (e.g., the very last one) and interpret its information. Create a labeled plot of the tree. List all the predictors showing up in the tree. Predict the response on the test data and calculate the MSE of those predictions. Apply the cv.tree() function to the training set and produce a plot with tree size on the z-axis and cross-validated MSE on the y-axis. Which tree size corresponds to the lowest cross-validated MSE? Fit a pruned tree corresponding to the optimal tree size obtained in (9) and apply the summary() function to it. If cross-validation does not lead to the selection of a pruned tree, then create a pruned tree with five terminal nodes. Compare the training MSE between the pruned and unpruned trees. Which is higher? Compare the test MSE between the pruned and unpruned trees. Which is higher
The Correct Answer and Explanation is:
To solve this problem using the Hitters dataset from the ISLR package, follow the step-by-step instructions below:
1. Load data and remove NAs
RCopyEditlibrary(ISLR)
library(tree)
data(Hitters)
Hitters <- na.omit(Hitters)
Hitters$LogSalary <- log(Hitters$Salary)
2. Create training and test sets
RCopyEditset.seed(1)
train <- sample(nrow(Hitters), 200)
train_data <- Hitters[train, ]
test_data <- Hitters[-train, ]
3. Fit a tree model
RCopyEdittree_model <- tree(LogSalary ~ . - Salary, data = train_data)
summary(tree_model)
This is a regression tree, since the response variable LogSalary is continuous.
From the summary, you will get:
- Residual mean deviance (which represents training MSE)
- Number of terminal nodes
4. Text output and plot
RCopyEdittree_model # Detailed structure of the tree
plot(tree_model)
text(tree_model, pretty = 0)
Choose a terminal node (e.g., node 10), and interpret: this node will include a rule path like “Years < 3.5” and display the mean LogSalary for players falling into that node.
5. Predict and compute test MSE
RCopyEditpredictions <- predict(tree_model, newdata = test_data)
test_mse <- mean((test_data$LogSalary - predictions)^2)
6. Cross-validation and pruning
RCopyEditset.seed(2)
cv_result <- cv.tree(tree_model)
plot(cv_result$size, cv_result$dev, type = "b", xlab = "Tree Size", ylab = "CV MSE")
From the plot or which.min(cv_result$dev), find the optimal size (e.g., 5).
7. Prune and evaluate
RCopyEditpruned_tree <- prune.tree(tree_model, best = 5)
summary(pruned_tree)
# Training MSE
train_pred_pruned <- predict(pruned_tree, newdata = train_data)
train_mse_pruned <- mean((train_data$LogSalary - train_pred_pruned)^2)
# Test MSE
test_pred_pruned <- predict(pruned_tree, newdata = test_data)
test_mse_pruned <- mean((test_data$LogSalary - test_pred_pruned)^2)
8. Comparison
Compare training and test MSEs of the original and pruned trees. The unpruned tree usually fits the training data better (lower training MSE), while the pruned tree may generalize better to the test data (lower test MSE). This indicates a classic trade-off between overfitting and generalization.
9. Predictors used in the tree
You can see which predictors appear by examining tree_model or summary(tree_model). Only the predictors involved in splits will show up.
Explanation: This analysis uses a regression tree to model LogSalary, a transformation that stabilizes variance and makes the distribution more normal. The training set is randomly selected, and a regression tree is fit using the tree package. The summary reveals performance and structure. By pruning the tree through cross-validation, we identify the optimal complexity. Comparing training and test MSEs helps evaluate model generalization. The entire process reflects key steps in supervised machine learning.
