This document details the following for a Machine Learning hackathon with automatic evaluation.
Mandatory resources
Data files
- Train.csv: Contains the training data
- Test.csv: Contains the test data (without target column)
- Sample submission.csv: Format that candidates must follow for their submission
Note: These files (Train,csv, test.csv, and Sample submission.csv) should be included in a single .ZIP file named dataset.zip
- Results.csv: Contains the identifier/Index column along with the truth values for test data
- Checker.py: Allows you to evaluate a candidate's submission automatically. It contains some checkpoints and an evaluation metric that generates a score. To learn more about checker files, read this article.
Problem statement
Please refer to this document to understand what an ideal problem statement must contain the following:
- Description of the problem statement
- Column descriptions
- Evaluation criteria
- Result submission guidelines
Do's and Don’ts
Dataset
The following checklist highlights the key guidelines to be followed when creating a problem statement and dataset:
- The dataset should be as unique as possible.
- Datasets with the following limitations must not be used:
- Require permission to be used
- Licensed datasets
- The data set should be anonymized and should not carry any original or confidential company identifier data.
- The data should not be collated from any open-source references. It is not recommended to use open-source data to avoid plagiarism in submissions and solutions.
- The accepted result file formats on the platform are as follows:
- .CSV
- .JSON
Problem statement
Ensure that the following guidelines are followed while creating the problem statement:
- The problem statement should be as unique as possible.
- The problem statement description should be structured properly.
- The problem statement should be simple, clear, and concise. The definition should be easy to understand with respect to the following:
- The objective of the problem
- Tasks
- Target variables
Final checks before the problem is shared with HackerEarth
The following sanity checks must be completed before the problem statement and data are shared with HackerEarth:
- The column used for the identification of the train dataset and test dataset should contain unique values.
- The column used for the identification of the test dataset and result.csv should be the same.
- All the components should be present in the dataset.zip file.
- The column names in all CSV files should match the column descriptions in the problem statement. The names of the columns are case-sensitive.
- The problem statement should contain all the components.