We know that Artificial Intelligence (AI) is all around us. The last few years have been especially incredible due to its increasing applications such as how we get news, interact with our phone and home, but also how we interact with each other.
The reasons for these advancements are diverse, nevertheless, we can say that the research results in artificial intelligence are due to better and faster hardware, more data and the advances that have been made in algorithms.
Model centric approach:
As of now the focus on developing AI applications was more on training and optimizing machine learning models given a certain data set. Data scientists focus was more on adopting, tweaking and changing the code of the machine learning models to make it better. Once it was trained and accepted it is passed on to an unknown other set of data to evaluate the performance of the model. The quality of the data did make an impact, however the focus was rather on refining the model itself than the data at hand.
Data centric approach:
Data is a key aspect in this journey and that’s why recently Andrew Ng also highlighted the importance of data-driven AI. Andrew Ng is a key person when it comes to AI and he recently also started a new data centric competition to highlight the importance of also adopting the data and ensuring its quality to improve model performance.
According to datacentricai.org “data-centric AI is the discipline of systematically engineering the data used to build an AI system”1. Since data is the power behind AI it is important to ensure that the data used for the model is in good shape but also make sure that we train our model with enough data to learn from. Very often data scientists take a certain part of the data (e.g. 2 years of data) that they can use to train the model. They train the model and leave it as is until they see a drop in performance. What if, however, we could use more data than usual exploring the possibilities that the amount of data provides. Would it make a difference in the performance?
We say yes, and agree to the new twist on what to look for when trying to make AI models better. What can you do to aim for a data centric approach?
Following Andrews advice on the DeepLearningAI Youtube channel2 we can use 5 tips in order to get to a more data centric approach:
1- Making the labels y consisten
When there is a difference in how a certain data point can be labeled, it needs to be defined prior to what exactly the requirements are that fulfill a class to be either one or the other. Make sure that this is consistent with all the data you have.
2- Using multiple labelers:
Looking at the examples he gave, this is a very useful tip that might be underestimated in practice. Each person who labels the dataset has a different way of defining the label. Make sure to agree upon starting to label how the labeling of the data should be done.
3- Repeatedly evaluating and clarifying the instructions
If you have ambiguity in your labels record it and make sure to document the decision that has been made on this ambiguity.
4- Get rid off bad examples
More data is not always better. In case you have data that is very unclear, even difficult to recognize for you as a human, better get rid of the data rather than using bad examples.
5- Use error analysis to find the data set that you want to improve
Rather than looking at all the data, identify which of the subset of data you are not satisfied with and choose these as a separate dataset to relabel, and clean this data. This adds overall value to your dataset quality.
Training machine learning models is a time intensive process. Although much of the effort goes into optimizing the code used for the model we should still focus on getting the right data to be used. Therefore adding additional time, ensuring quality standards and aligning with the team is key to success.
Text by Sanja Jovanovic
Photo by Ales Nesetril on Unsplash
1 See: https://datacentricai.org/
2 See: https://www.youtube.com/watch?v=Yqj7Kyjznh4