Do's and don'ts - ML Hackathon

This document details the following for a Machine Learning hackathon with automatic evaluation.

Train.csv: Contains the training data
Test.csv: Contains the test data (without target column)
Sample submission.csv: Format that candidates must follow for their submission
Note: These files (Train,csv, test.csv, and Sample submission.csv) should be included in a single .ZIP file named dataset.zip

Results.csv: Contains the identifier/Index column along with the truth values for test data
Checker.py: Allows you to evaluate a candidate's submission automatically. It contains some checkpoints and an evaluation metric that generates a score. To learn more about checker files, read this article.

Please refer to this document to understand what an ideal problem statement must contain the following:

The following checklist highlights the key guidelines to be followed when creating a problem statement and dataset:

The dataset should be as unique as possible.
Datasets with the following limitations must not be used:
1. Require permission to be used
2. Licensed datasets
The data set should be anonymized and should not carry any original or confidential company identifier data.
The data should not be collated from any open-source references. It is not recommended to use open-source data to avoid plagiarism in submissions and solutions.
The accepted result file formats on the platform are as follows:

Ensure that the following guidelines are followed while creating the problem statement:

The problem statement should be as unique as possible.
The problem statement description should be structured properly.
The problem statement should be simple, clear, and concise. The definition should be easy to understand with respect to the following:
- The objective of the problem
- Tasks
- Target variables

The following sanity checks must be completed before the problem statement and data are shared with HackerEarth:

The column used for the identification of the train dataset and test dataset should contain unique values.
The column used for the identification of the test dataset and result.csv should be the same.
All the components should be present in the dataset.zip file.
The column names in all CSV files should match the column descriptions in the problem statement. The names of the columns are case-sensitive.
The problem statement should contain all the components.