Ethical data splitting is of paramount importance to ensure the validity of any solution that is based on data. If data is biased, it will not accurately represent how the solution will solve the problem. To ethically split data, the overall variance of the data needs to be fairly represented in the training and the testing sets of the dataset. To do this, the outliers of the data need to be determined so that they can be accounted for when splitting the data. Finding the principal components of the data using the L2-norm has been shown as an effective way to identify outliers of data to make a robust dataset that is resistant to outliers. It has been shown that the L1-norm is more resistant to outliers than the L2-norm, so it will allow the dataset to become more resistant to outliers. Therefore, utilizing L1-norm principal components when determining ethical data splits will result in more robust datasets.
|