There's no such thing as raw data

18.01.22

There's no such thing as raw data

The promise of deep learning was to eliminate feature engineering pipelines. That's probably a myth.

The story goes like this: deep learning works on raw data, classical ML needs engineered features. Therefore deep learning should get rid of the need for hand curated, fragile, feature pipelines. I'm fairly sure this is not the case for structured data, but according to an article by Pete Warden, this isn't even true our for the archetypal application of deep learning: vision.

Even in vision, there is a lot of processing to go from the RAW camera data to evenly spaced RGB pixel data that is the input to most models. The hidden engineering comes in order to make the data understandable to human image processing systems. In other situations, where we might be dealing with mixed types of data from multiple sources, the need for feature engineering and data cleansing is even greater.

Don't throw away those data pipelines just yet!

←Previous: Stop using boxplots

Next: How does TikTok's algorithm work?→

There's no such thing as raw data

Keep up with the latest developments in data science. One email per month.