Inside this 1.8Tb monster are over 100,000 frames of imagery and associated data that were crowd sourced from all over the USA on dash mounted cameras. That might not sound exciting on the surface and indeed if you watch many of the 40 second clips they aren’t. What makes this a really exciting collection is the sheer scale of the information, never before has such a large, diverse range of data been made available to the public for training of artificial intelligence systems.

We talked to our friend Marcus Vernon at Arion AI about how executives approach Artificial Intelligence (AI) projects. Arion AI is a leading data science and machine learning consultancy to the aerospace, finance and retail sectors.

“There's a lot of misconception around AI and machine learning, about what it is and how it works. The hype in the media gives a confusing picture, one question that I'm asked a lot is: what's the difference between AI and machine learning?

Simply put, machine learning provides the ability to analyse existing complex data using a computer to train a mathematical model for specific applications. The model can then predict outputs based on similar new inputs, which can be anything from images to sound or numeric data.

The definition of AI changes with the advancement of technology. In the 1970s a pocket calculator could have been called 'Artificial Intelligence', but you wouldn't say that today. Similarly, even though machine learning is astonishingly good, it still needs a human to tell it what problem to solve and the boundaries that define it. In the future, new technologies will push its ability even further, but we aren't anywhere close to the Hollywood definition yet!"

Marcus Vernon
Marcus Vernon

There are many challenges in the AI industry but the ones we focused on in this activity were:

  • Storing, moving and getting fast access to a large enough dataset is hard when dealing with video data.
  • Recording the results of different AI’s and variations is time consuming and being able to do this faster will greatly improve the ability to refine the accuracy.

Training is one of the keys to the success of an AI so the way it is delivered is also of vital importance, you can not simply point a AI algorithm at a folder of files and tell it to go learn. The quality and diversity of data is also of great importance so being able to deliver this amazing dataset with the framework of Col8 will form the step up in AI training being sought in the Data science industry.

We want to give our data science partners this amazing tool and dataset to work from but before we could do that there was the task of extracting and processing all the raw data from the DeepDrive project. Over the course of 28hrs we automatically downloaded, processed and organized all the videos against their datasets in to individual records on Col8. Without even bringing AI in to the equation we went from a flat folder structure with 10’s of thousands of randomly named files to this beautiful image of San Francisco and New York formed of purely crowd sourced video. Scroll around the interactive map and instantly access any video based on it’s location or date filters.

Our long term advisor to Col8, Dr Lukasz Piwek lecturer and researcher in Data Science at the University of Bath made a comment recently about this piece of work:

“Data sets obtained from social networks like Twitter provide great source of textual and meta-data that we can analyse efficiently. However, being able to bring such “static" data together with image and video has always been incredibly difficult. A lot of AI developers focus on specific areas, like image recognition or sentiment analysis where it is difficult to bring together capabilities of efficient multimedia and metadata integration. I really see Col8 as having the potential to bridge these multi disciplinary areas of data science and help us unlock new levels of insights.” 

Lukasz Piwek
Lukasz Piwek

Application Programming Interface (API) is something you may be aware of but simply it is the ability for computers to talk to each and exchange data in a common format. Col8 is built on top of these at its core which is what our data science partners will be using to move data in / out of the system to train and run the AI’s.

In part two of this adventure we will be reporting on how this has gone with researchers in some of the worlds most advanced AI institutions.