One of the things that keep surprising me over and over again is how much effort companies spend on processing, cleaning, converting and preparing data. For the companies that I work with, the data science teams easily spend 90-95 percent of their time just preparing data for use in machine learning/deep learning deployments. This is caused by several challenges, of which data silos, lack of labeled data, unbalanced training sets, training/serving skew and unstable data pipelines are the most common ones.
Data silos are concerned with the typical situation that every team and department conduct their own data collection for their own purposes. As many companies use DevOps, continuous deployment or some type of frequent, periodic delivery, the semantics of the data tends to change with every new version of the software. That makes it difficult to use data collected over longer periods for training, requiring data scientists and engineers to manually review the data from every period and convert it to create a larger, homogeneous set that can be used for training.