Let's talk about the datasets that you need to build an ML problem on HackerEarth
What kinds of datasets do I need while building a Machine Learning problem on HackerEarth?
Your dataset must be divided into two parts:
- Training data set
- Test data set
What is a training data set?
A training data set is the data that candidates will use to train their models.
What is a test data set?
A test data set is the unseen data that candidates will use to predict an outcome. The test data must not specify the outcome.
How do I divide my dataset?
The following data set of 10 rows can be divided into the following:
- Training data set (50% of the rows)
- Test data set (remaining 50% of the rows)
Entire data set
Outlook |
Temperature |
Humidity |
Wind |
Play |
Sunny |
Hot |
High |
False |
No |
Rainy |
Mild |
High |
False |
Yes |
Sunny |
Cool |
Normal |
False |
Yes |
Overcast |
Hot |
High |
False |
Yes |
Rainy |
Mild |
High |
False |
Yes |
Overcast |
Hot |
Normal |
False |
Yes |
Sunny |
Mild |
Normal |
True |
Yes |
Sunny |
Mild |
High |
False |
No |
Overcast |
Cool |
Normal |
True |
Yes |
Rainy |
Mild |
High |
True |
Yes |
Training data set
ID |
Outlook |
Temperature |
Humidity |
Wind |
Play |
1 |
Sunny |
Hot |
High |
False |
No |
2 |
Rainy |
Mild |
High |
False |
Yes |
3 |
Sunny |
Cool |
Normal |
False |
Yes |
4 |
Overcast |
Hot |
High |
False |
Yes |
5 |
Rainy |
Mild |
High |
False |
Yes |
Test data set (test.csv)
ID |
Outlook |
Temperature |
Humidity |
Wind |
1 |
Sunny |
Hot |
High |
False |
2 |
Rainy |
Mild |
High |
False |
3 |
Sunny |
Cool |
Normal |
False |
4 |
Overcast |
Hot |
High |
False |
5 |
Rainy |
Mild |
High |
False |
Note: We have not provided the target variable in the test data set.
What does the candidate do after the models are trained?
After the models have been trained, candidates are expected to do the following:
- To predict an outcome on the test data set
- Upload the prediction file on HackerEarth