HackerEarth’s ML platform supports the Machine Learning flow as shown in the diagram in the Typical learning section.
How do we create an ML problem?
We divide the entire data into two parts:
- Training data set
- Test data set
What is a training data set?
A training data set is the data that candidates will use to train their models.
What is a test data set?
A test data set is the unseen data that the candidates will use to predict an outcome.
Note: The test data that we give to the candidates does not specify the outcome.
What does the candidate do after the models are trained?
After the models have been trained, the candidates are expected to do the following:
- To predict an outcome on the test data set
- Submit the prediction file
Example
The following data set of 10 rows can be divided into:
- Training data set (50% of the rows)
- Test data set (remaining 50% of the rows)
Entire data set
Outlook |
Temperature |
Humidity |
Wind |
Play |
Sunny |
Hot |
High |
False |
No |
Rainy |
Mild |
High |
False |
Yes |
Sunny |
Cool |
Normal |
False |
Yes |
Overcast |
Hot |
High |
False |
Yes |
Rainy |
Mild |
High |
False |
Yes |
Overcast |
Hot |
Normal |
False |
Yes |
Sunny |
Mild |
Normal |
True |
Yes |
Sunny |
Mild |
High |
False |
No |
Overcast |
Cool |
Normal |
True |
Yes |
Rainy |
Mild |
High |
True |
Yes |
Training data set
ID |
Outlook |
Temperature |
Humidity |
Wind |
Play |
1 |
Sunny |
Hot |
High |
False |
No |
2 |
Rainy |
Mild |
High |
False |
Yes |
3 |
Sunny |
Cool |
Normal |
False |
Yes |
4 |
Overcast |
Hot |
High |
False |
Yes |
5 |
Rainy |
Mild |
High |
False |
Yes |
Test data set (test.csv)
ID |
Outlook |
Temperature |
Humidity |
Wind |
1 |
Sunny |
Hot |
High |
False |
2 |
Rainy |
Mild |
High |
False |
3 |
Sunny |
Cool |
Normal |
False |
4 |
Overcast |
Hot |
High |
False |
5 |
Rainy |
Mild |
High |
False |
Note: We have not provided the target variable in the test data set.