{"id":237504,"date":"2025-06-17T09:41:38","date_gmt":"2025-06-17T09:41:38","guid":{"rendered":"https:\/\/learnexams.com\/blog\/?p=237504"},"modified":"2025-06-17T09:41:40","modified_gmt":"2025-06-17T09:41:40","slug":"this-problem-involves-the-hitters-data-set-a-part-of-the-islr-package-on-salaries-of-baseball-hitters","status":"publish","type":"post","link":"https:\/\/www.learnexams.com\/blog\/2025\/06\/17\/this-problem-involves-the-hitters-data-set-a-part-of-the-islr-package-on-salaries-of-baseball-hitters\/","title":{"rendered":"This problem involves the Hitters data set, a part of the ISLR package, on salaries of baseball hitters."},"content":{"rendered":"\n<p>This problem involves the Hitters data set, a part of the ISLR package, on salaries of baseball hitters. Dispose of &#8220;NA&#8221; observations and perform a log-transformation on the Salary variable. Then create a training set containing a random sample of 200 observations. library(ISLR) Hitters<br>Salary) Hitters &lt;- na.omit(Hitters) set.seed(1) train &lt;- sample(nrow(Hitters), 200) Fit a tree to the training data, with LogSalary as the response and the other variables as predictors. Is it a classification or regression tree? Use the summary() function to produce summary statistics about the tree. What is the training MSE? How many terminal nodes does the tree have? Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes (e.g., the very last one) and interpret its information. Create a labeled plot of the tree. List all the predictors showing up in the tree. Predict the response on the test data and calculate the MSE of those predictions. Apply the cv.tree() function to the training set and produce a plot with tree size on the z-axis and cross-validated MSE on the y-axis. Which tree size corresponds to the lowest cross-validated MSE? Fit a pruned tree corresponding to the optimal tree size obtained in (9) and apply the summary() function to it. If cross-validation does not lead to the selection of a pruned tree, then create a pruned tree with five terminal nodes. Compare the training MSE between the pruned and unpruned trees. Which is higher? Compare the test MSE between the pruned and unpruned trees. Which is higher<\/p>\n\n\n\n<p><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-1-color\">The Correct Answer and Explanation is:<\/mark><\/strong><\/p>\n\n\n\n<p>To solve this problem using the <strong>Hitters<\/strong> dataset from the <strong>ISLR<\/strong> package, follow the step-by-step instructions below:<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">1. <strong>Load data and remove NAs<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>library(ISLR)\nlibrary(tree)\n\ndata(Hitters)\nHitters &lt;- na.omit(Hitters)\nHitters$LogSalary &lt;- log(Hitters$Salary)\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">2. <strong>Create training and test sets<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>set.seed(1)\ntrain &lt;- sample(nrow(Hitters), 200)\ntrain_data &lt;- Hitters[train, ]\ntest_data &lt;- Hitters[-train, ]\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">3. <strong>Fit a tree model<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>tree_model &lt;- tree(LogSalary ~ . - Salary, data = train_data)\nsummary(tree_model)\n<\/code><\/pre>\n\n\n\n<p>This is a <strong>regression tree<\/strong>, since the response variable <code>LogSalary<\/code> is continuous.<\/p>\n\n\n\n<p>From the summary, you will get:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Residual mean deviance<\/strong> (which represents training MSE)<\/li>\n\n\n\n<li><strong>Number of terminal nodes<\/strong><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">4. <strong>Text output and plot<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>tree_model  # Detailed structure of the tree\nplot(tree_model)\ntext(tree_model, pretty = 0)\n<\/code><\/pre>\n\n\n\n<p>Choose a terminal node (e.g., node 10), and interpret: this node will include a rule path like &#8220;Years &lt; 3.5&#8221; and display the mean LogSalary for players falling into that node.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">5. <strong>Predict and compute test MSE<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>predictions &lt;- predict(tree_model, newdata = test_data)\ntest_mse &lt;- mean((test_data$LogSalary - predictions)^2)\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">6. <strong>Cross-validation and pruning<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>set.seed(2)\ncv_result &lt;- cv.tree(tree_model)\nplot(cv_result$size, cv_result$dev, type = \"b\", xlab = \"Tree Size\", ylab = \"CV MSE\")\n<\/code><\/pre>\n\n\n\n<p>From the plot or <code>which.min(cv_result$dev)<\/code>, find the optimal size (e.g., 5).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">7. <strong>Prune and evaluate<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">RCopyEdit<code>pruned_tree &lt;- prune.tree(tree_model, best = 5)\nsummary(pruned_tree)\n\n# Training MSE\ntrain_pred_pruned &lt;- predict(pruned_tree, newdata = train_data)\ntrain_mse_pruned &lt;- mean((train_data$LogSalary - train_pred_pruned)^2)\n\n# Test MSE\ntest_pred_pruned &lt;- predict(pruned_tree, newdata = test_data)\ntest_mse_pruned &lt;- mean((test_data$LogSalary - test_pred_pruned)^2)\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">8. <strong>Comparison<\/strong><\/h3>\n\n\n\n<p>Compare training and test MSEs of the original and pruned trees. The unpruned tree usually fits the training data better (lower training MSE), while the pruned tree may generalize better to the test data (lower test MSE). This indicates a classic trade-off between overfitting and generalization.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">9. <strong>Predictors used in the tree<\/strong><\/h3>\n\n\n\n<p>You can see which predictors appear by examining <code>tree_model<\/code> or <code>summary(tree_model)<\/code>. Only the predictors involved in splits will show up.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>Explanation<\/strong>: This analysis uses a regression tree to model <code>LogSalary<\/code>, a transformation that stabilizes variance and makes the distribution more normal. The training set is randomly selected, and a regression tree is fit using the <code>tree<\/code> package. The summary reveals performance and structure. By pruning the tree through cross-validation, we identify the optimal complexity. Comparing training and test MSEs helps evaluate model generalization. The entire process reflects key steps in supervised machine learning.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/learnexams.com\/blog\/wp-content\/uploads\/2025\/06\/learnexams-banner8-855.jpeg\" alt=\"\" class=\"wp-image-237505\"\/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>This problem involves the Hitters data set, a part of the ISLR package, on salaries of baseball hitters. Dispose of &#8220;NA&#8221; observations and perform a log-transformation on the Salary variable. Then create a training set containing a random sample of 200 observations. library(ISLR) HittersSalary) Hitters &lt;- na.omit(Hitters) set.seed(1) train &lt;- sample(nrow(Hitters), 200) Fit a tree [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[25],"tags":[],"class_list":["post-237504","post","type-post","status-publish","format-standard","hentry","category-exams-certification"],"_links":{"self":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts\/237504","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/comments?post=237504"}],"version-history":[{"count":0,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts\/237504\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/media?parent=237504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/categories?post=237504"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/tags?post=237504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}