Your company doesn't need a Data Lake to start a Machine Learning project

TL; DR:

  • You don't need a Data Lake to start Machine Learning projects
  • Structured and reliable data is usually enough
  • It is more important to start building a simple model with the available data than to start with a complex data collection system.

This text was prepared and written by a group of professionals working in the area of Machine Learning with experience in several clients. These, in turn, had different levels of maturity in terms of data structures and information in the organization.

Unsurprisingly, the volume of data generated per minute today is immense. With most computerized systems in companies, every iteration is stored somehow, whether between the customer and the company, between employees or with suppliers. At tech giants, the numbers are staggering. In August 2019, according to Forbes [5], every minute:

  • 694.444 hours of video are watched on Netflix
  • 1.389 reservations are made on Airbnb
  • 4.500.000 videos are watched on YouTube
  • 512.200 tweets are sent on twitter
  • 277.777 stories are posted on Instagram
  • 9.772 Uber rides are requested

Of course, not all companies have as many users or traffic as the ones mentioned above, but that doesn't mean they don't suffer from large volumes of data being created and processed at every moment. According to The Economist [6], Data is now the world's most valuable resource, surpassing even oil. This is due to advances in the fields of Machine Learning and Data Science, which have given us techniques that allow us to take insights valuable from this data. So, how to handle so many bytes, so as not to let anything leak?

First, storage becomes a nightmare. On Twitter, for example, 8 terabytes are stored per day [2]. A conventional personal computer has an average of 1 terabyte of storage, that is, 8 computers a day would be needed to store the stream of tweets. Companies have to be prepared to store their own data.

In second place you need to pay attention to the history. Machine learning models are rarely useful without a long enough history that can be analyzed. Companies that have been operating in the market for some time and were wise enough to store their data surely already have enough material to start a machine learning project, but where are these data? Most of the time, in Excel spreadsheets (or .csv files), on a company server or on an analyst's computer. At other times, this data is in transactional application databases, just waiting for a query to be made to generate a monthly report.

In companies that are more mature in terms of data storage and processing, it is common to find Data Warehouses. These generally take place through a set of information consolidation processes selected by specialists and business analysts, who will in turn feed dashboards and information systems. Business Intelligence. Such companies are already aware of the value of the information they produce and are able to obtain insights to support data-driven decisions, even if they don't yet have applications that use machine learning.

It is against this backdrop that the idea of data lakea place where all data is stored, so that when starting another machine learning project, all the necessary information will already be available. But is the Data Lake really a prerequisite for machine learning projects?

Data Lake vs Data Warehouse

Some companies already have Data Warehouses in place, which in practice is very different from a Data Lake. First, let's understand what each one is.

Data Warehouse

In Portuguese, “Data Warehouse” is usually a subject-structured database [4] (e.g. customer invoices) that standardizes data from the various company systems, being updated periodically (e.g. once a day) through ETL (Extract, Transform and Load) processes. databases, apply the necessary transformations and load them into the Data Warehouse). this bank aims to support business analytics without compromising the efficiency of transactional banks.

Example of a Data Warehouse data model [7]

data lake

The Data Lake, on the other hand, is a centralized repository to store structured data (such as information from database tables) and unstructured (such as documents and images), without pre-processing and with metadata that allow browsing the information in the future.

Example of a Data Lake structure

What are the main differences?

Although both were created to allow analyzing information without impacting the company's current processes, there are crucial differences between a Data Lake and a Data Warehouse, as shown in the table below [1] [8]:

But after all, is the Data Lake a prerequisite for Machine Learning?

Building a Data Lake is not a simple task. It is necessary to know all the processes from which you want to capture the data, which types of databases are applied, which tools are capable of monitoring the evolution of these databases and where to send the collected data. In addition, it is necessary to specify which data will be collected as a large volume of data may require infrastructure improvements, such as purchasing more disks, increasing network bandwidth, adding redundancies, etc.

Usually, when we want to apply machine learning techniques to improve some process, we already know relevant data and we know how it is mapped. These data can be used to build an initial model that, even without using all the information, is predictive at some useful level.

It's a common practice in machine learning initially build a simple model, with only some of the information available, and then iterate over the process to improve its predictive ability. Building a Data Lake can take more time than developing a predictive model.

As an example, let's assume that the company wants to forecast your demand in the coming months. A professional is hired to develop the machine learning model that will make the prediction. This professional can have two choices:

  1. Build a Data Lake to capture all sales-related data
  2. Access sales databases and work with information already available

In the first option, the professional will have more information available to build your model, but it will take a good part of the project collecting this information and hoping to form a meaningful history, and only then can you start developing your model.

In the second option, the professional will only have the analysts' view of the data and will need to be careful not to impact transactional banks... start predictive model development immediately. This means that the professional will have more mastery over the problem after a short time, which will make her understand which factors are most relevant and predictive. Furthermore, it is possible that just the information used in the first attempt is enough to build a good predictive model!

After the development and deployment of the predictive model, the professional may then want to develop the Data Lake to feed and improve it in the future. She now has enough knowledge of the process and its variables to know what data to capture and store!

To start a machine learning project a company needs:

  • One relevant history of data about the process. The size of the history itself varies depending on the purpose of the project.
  • The information must be mapped, that is, one must know in which databases they are stored and under what conditions they are generated.
  • Professionals in the field must have access to that information. In large companies, the scenario where only area administrators know and have access to databases is very common.
  • The information needs to be consistent. It is useless if one system uses the customer's CPF as an identifier and another uses the RG, as it will hardly be possible to identify the same customer's data in both systems. When we do data engineering, we are constantly cross-referencing information from two or more sources.
  • The data must have quality and must be reliable. If systems allow errors to be recorded, it becomes difficult to distinguish real information from corrupted information.
  • Data Lake and Data Warehouse are accelerators of the process and help to develop better quality models.

After all, is the Data Lake a prerequisite for machine learning? No. It is perfectly possible to start building a predictive model without having a Data Lake up and running, but its existence certainly helps and speeds up development. However, in terms of priority, it is more interesting to have predictive models deployed than a Data Lake.

Thanks to the Kunumi team responsible for writing this article (in alphabetical order):

Alex Barros, Eduardo Nigri, Guilherme Andrade, Lucas Miranda, Lucas Parreiras, Luis Sepulvene, Thays Silva, Tiago Alves

About Kunumi:

Founded in 2016, the Kunumi is the youngest company of a successful group that accumulates more than 20 years developing machine learning. It carries out multidisciplinary projects and research in the health, art, education, chemical, retail and financial sectors, among others. It applies the proprietary Evolve methodology in partner companies to maximize value generation and mitigate the risks inherent to the implementation of artificial intelligence projects.

References:

  1. https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
  2. https://www.slideshare.net/raffikrikorian/twitter-by-the-numbers/13-Where_do_they_go_Followed
  3. https://www.deeplearning.ai/machine-learning-yearning/
  4. https://www.devmedia.com.br/data-warehouse/12609
  5. https://www.forbes.com/sites/nicolemartin1/2019/08/07/how-much-data-is-collected-every-minute-of-the-day/#16ff3e883d66
  6. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
  7. https://databricks.com/glossary/data-warehouse
  8. https://www.talend.com/resources/data-lake-vs-data-warehouse/

Other articles: