Do's and don'ts - ML Hackathon

This document details the following for a Machine Learning hackathon with automatic evaluation.

Mandatory resources

Data files

  • Train.csv: Contains the training data
  • Test.csv: Contains the test data (without target column)
  • Sample submission.csv: Format that candidates must follow for their submission
    Note: These files (Train,csv, test.csv, and Sample submission.csv) should be included in a single .ZIP file named dataset.zip
  • Results.csv: Contains the identifier/Index column along with the truth values for test data
  • Checker.py: Allows you to evaluate a candidate's submission automatically. It contains some checkpoints and an evaluation metric that generates a score. To learn more about checker files, read this article.

Problem statement

Please refer to this document to understand what an ideal problem statement must contain the following:

  • Description of the problem statement
  • Column descriptions
  • Evaluation criteria
  • Result submission guidelines

Do's and Don’ts

Dataset

The following checklist highlights the key guidelines to be followed when creating a problem statement and dataset:

  1. The dataset should be as unique as possible.
  2. Datasets with the following limitations must not be used:
    1. Require permission to be used 
    2. Licensed datasets 
  3. The data set should be anonymized and should not carry any original or confidential company identifier data.
  4. The data should not be collated from any open-source references. It is not recommended to use open-source data to avoid plagiarism in submissions and solutions.
  5. The accepted result file formats on the platform are as follows: 
  • .CSV 
  • .JSON

Problem statement

Ensure that the following guidelines are followed while creating the problem statement:

  1. The problem statement should be as unique as possible.
  2. The problem statement description should be structured properly.
  3. The problem statement should be simple, clear, and concise. The definition should be easy to understand with respect to the following:
    • The objective of the problem
    • Tasks
    • Target variables

Final checks before the problem is shared with HackerEarth

The following sanity checks must be completed before the problem statement and data are shared with HackerEarth:

  • The column used for the identification of the train dataset and test dataset should contain unique values.  
  • The column used for the identification of the test dataset and result.csv should be the same.
  • All the components should be present in the dataset.zip file.
  • The column names in all CSV files should match the column descriptions in the problem statement. The names of the columns are case-sensitive.
  • The problem statement should contain all the components.