Categories
Tech info

Feature store meaning and benefits

Feature store (free) : definition and advantage in AI

A feature store manages the storage, updating and sharing of features of machine learning pipelines and models. A solution that is essential for an AI factory.

What is the feature store?

The notion of feature store was first introduced by Uber in a 2017 published post. Under this term, the VTC refers to the repository it uses to store the features of its machine learning models. The challenge? Making it easier to share these features from project to project. Uber claims to be confronted with numerous modeling problems involving similar or identical attributes. Hence the idea of pooling them through a central database.

In the context of an AI factory, the feature store is central. It allows to capitalize on the learning functions already created for new developments. So much time is saved in feature engineering. In the field of health, features can refer to an individual’s blood type, height curve, weight… The feature store centralizes the features and allows them to be easily shared with other machine learning models.

Example of a feature store

Features refer to the information used to feed the machine learning. In the case of a recommendation AI on a music streaming platform, it is for example the songs already listened to, their playback time or their ranking in terms of audience.

Thanks to a feature store tool, the streaming platform will typically be able to reuse the traffic ranking used here for other models, on the advertising targeting front for example.

What is the advantage of a feature store?

The feature store consolidates machine learning workflows around a single pipeline for training, testing and validating models, as well as executing them in the field. It allows you to benefit from a single source of truth with a single data transformation method per feature. Federating transformation methods in this way facilitates their monitoring and validation, and simplifies the tracking of biases that may arise from the features or data. The repository also stores feature metadata and their history. This makes it possible to keep track of the comments made by data scientists on the influence of features on a given model, and thus to better identify the types of problems to which they can respond.

During the training phase, the feature store is also there to guarantee the integrity of the data sets. “The training data may mistakenly include information dedicated to testing the model once it has been trained. This obviously distorts the results. This is called a feature leak,” explains Ismaïl Lachheb at Octo Technology. Such a leak happens quickly. “In the case of a database with a dozen joins (between the tables, editor’s note), all it takes is an error in one of them for the learning set to access data intended for testing,” says the data scientist. During the creation of the various data sets, the feature store is responsible for guaranteeing the watertightness of the data, regardless of the number of joins and tables used. “It manages the versioning and execution of the learning process in line with the evolution of the state of the data over time,” adds Sergio Winter, machine learning engineer at Revolve, a Devoteam entity with expertise in AWS.

The last major benefit of the feature repository is that the feature store ensures standardization of feature formatting and calculation between learning and real-world prediction. “If the data preprocessing is not exactly the same in both cases, a learning/invocation bias will appear with a negative impact on the quality of the results,” warns Ismaïl Lachheb. This difference may be due to negligence or to the management of the learning and inference data sources by different teams. By controlling the featuring of the data during both prediction and training, the feature repository ensures consistency between the two sources.

What are the feature store tools?

To get started, there are several open source feature store tools. The most popular ones are :

  • Feast,
  • Hopsworks,
  • Tecton.

Tecton was created by the developers behind Uber’s AI platform (read the article Feature store comparison: Tecton outshines Feast and Hopsworks).On the cloud side, Amazon Web Services (AWS) and Google also market managed feature repositories. AWS’ offering has the advantage of integrating with Data Wrangler. “Unlike Google’s Cloud Dataprep (which is based on a third-party application from Trifacta, ed.), it’s a graphical tool that not only handles batch but also data transformations and data set updates in real time,” compares Sergio Winter (see the article AI Cloud Platforms: Amazon and Microsoft outpaced by Google).

Databricks Feature Store

Databricks offers its own AI platform: Databricks Machine Learning. Designed to run on the editor’s infrastructure designed to federate big data analytics and machine learning, it includes a feature store.

Feast: open source feature store

Feast is an open source feature management library. This library allows to define feature stores to help building models and retrieving data.

Feature store vs data warehouse

A data warehouse is a data warehouse (or EDD). It is a relational database that collects data from a wide variety of sources. Its main function is to validate an analysis and optimize the decision-making process of a company.

The feature store is a kind of data warehouse (feature-oriented) at the service of machine learning. The feature store is different from an architectural point of view, insofar as it is a double database with its own particularities:

  • A database that contains data distributed by the SDK (Software Development Kit), with a large temporal depth.
  • A database that contains fresh data and streaming data, this DB (database) is faster to serve “fresh” data.
Share on social media

Leave a Reply

Your email address will not be published. Required fields are marked *