It may come as no surprise that the internet has been swelling up with an increasing amount of data, so much so that it’s become difficult to keep track of. If in 2005 we were barely dealing with 0.1 zettabytes of data, this number is now just above 20 zettabytes and it is even estimated to reach a staggering47 zettabytes by 2020.

Apart from the sheer enormous quantity of it, the problem resides in the fact that it’s mostly unstructured. And there’s nothing more harmful for mankind than providing AI with incomplete or inaccurate data.

It seems that we are dealing with about only 10% of structured data, while the rest is just a great jumble of information that isn’t tagged and cannot be used in a constructive way by machines. For a better understanding on this subject, it’s good to know that email does not qualify as structured data, while anything such as a spreadsheet is considered to be tagged and can successfully be scanned by machines.

This may not seem that problematic, but we need to have clean and organized data if we expect AI to improve our lives in sectors such as healthcare, driverless cars, connected homes and so on. The irony is that we’ve become really good at creating content and data, but we haven’t yet figured out a way to accurately leverage it to serve our needs.

It's only natural that data science is one of the fields that gained a lot of ground across these past years, with more and more data scientists dedicating their lives to sort out the mess. However, a recent survey shows that contrary to popular opinion, data scientists spend a lot less time on building algorithms and mining data for patterns, but rather on doing this so-called digital janitorial work — cleaning and organizing data.

