Wikis for Learning and Collaboration A WIKI is a collaborative website that collects and organizes content, created and revised by its users. The most well-known example is Wikipedia. Wikis are a way to grow a knowledge base around a particular content area, be it best practices in a particular field or how to use a specific piece of hardware/software. A hallmark of Web 2.0 is that it improves as more people use it, and this approach underlies wiki-based learning. It is based on the idea that within any enterprise, a great deal of knowledge exists among the members. Sharing this knowledge and information can raise the organization’s intelligence level, be it a university, an association, a corporation, or club. Instructions: Each student will create a new wiki entry from a peer-reviewed research paper that pertains to the subject below or provide a summary or substantive commentary on an existing WIKI entry from a classmate. Admin Notes: Conduct your own research and post a short relevant summary of your findings. (Post current information, not older than five years). Use no more than three (3) references. Remember to place your name in your paper. Topic: Introduction to Data Mining Chapter 4: Classification – Basic Concepts. Task: Select 2 and provide samples of the following models for evaluation: 1. Holdout 2. Random subsampling 3. Cross-validation 4. Stratified sampling 5. Bootstrap 6. Random subsampling
The Correct Answer and Explanation is:
Wiki Entry by Eva Nyamu
Topic: Introduction to Data Mining – Chapter 4: Classification – Basic Concepts
Task: Model Evaluation Techniques: Holdout and Cross-Validation
1. Holdout Method
The holdout method is one of the simplest ways to evaluate a classification model. The dataset is split into two distinct sets: the training set and the test set. Commonly, 70% of the data is used for training and 30% for testing, though this ratio may vary.
Example:
Suppose we have a dataset of 1,000 patient records used to predict the likelihood of developing diabetes. We assign 700 records to train a decision tree model. The remaining 300 records are used to test the model’s accuracy. If the model correctly predicts outcomes for 270 out of 300 test cases, the accuracy is 90%.
Advantages:
- Fast and easy to implement
- Useful for large datasets
Disadvantages:
- Results can vary depending on how the data is split
- May not be reliable for small datasets
2. Cross-Validation (k-Fold)
Cross-validation improves the reliability of model evaluation by reducing the impact of data split bias. In k-fold cross-validation, the data is divided into k equal parts (folds). The model is trained k times, each time using k-1 folds for training and the remaining fold for testing. The final performance is averaged over all k trials.
Example:
Using the same dataset of 1,000 records, we perform 5-fold cross-validation. This means we split the data into 5 parts of 200 records each. The model is trained on 800 records and tested on 200 records in five different combinations. We average the accuracy results from each fold to get a more stable evaluation.
Advantages:
- Reduces variance in performance estimation
- Makes full use of the dataset for both training and testing
Disadvantages:
- Requires more computational time
References:
- Han, J., Pei, J., & Kamber, M. (2021). Data Mining: Concepts and Techniques (4th ed.). Morgan Kaufmann.
- Kotsiantis, S. B. (2020). Supervised Machine Learning: A Review of Classification Techniques. Informatica, 31(3), 249–268.
- Brownlee, J. (2020). Machine Learning Mastery With Python. Machine Learning Mastery.
