Building your own ML problem: Datasets

Skip to content

English

What kinds of datasets do I need while building a Machine Learning problem on HackerEarth?

Your dataset must be divided into two parts:

Training data set
Test data set

What is a training data set?

A training data set is the data that candidates will use to train their models.

What is a test data set?

A test data set is the unseen data that candidates will use to predict an outcome. The test data must not specify the outcome.

How do I divide my dataset?

The following data set of 10 rows can be divided into the following:

Training data set (50% of the rows)
Test data set (remaining 50% of the rows)

Entire data set

Outlook	Temperature	Humidity	Wind	Play
Sunny	Hot	High	False	No
Rainy	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Overcast	Hot	High	False	Yes
Rainy	Mild	High	False	Yes
Overcast	Hot	Normal	False	Yes
Sunny	Mild	Normal	True	Yes
Sunny	Mild	High	False	No
Overcast	Cool	Normal	True	Yes
Rainy	Mild	High	True	Yes

Training data set

ID	Outlook	Temperature	Humidity	Wind	Play
1	Sunny	Hot	High	False	No
2	Rainy	Mild	High	False	Yes
3	Sunny	Cool	Normal	False	Yes
4	Overcast	Hot	High	False	Yes
5	Rainy	Mild	High	False	Yes

Test data set (test.csv)

ID	Outlook	Temperature	Humidity	Wind
1	Sunny	Hot	High	False
2	Rainy	Mild	High	False
3	Sunny	Cool	Normal	False
4	Overcast	Hot	High	False
5	Rainy	Mild	High	False

Note: We have not provided the target variable in the test data set.

What does the candidate do after the models are trained?

After the models have been trained, candidates are expected to do the following:

To predict an outcome on the test data set
Upload the prediction file on HackerEarth