What kind of data lake do we need in the Big Model era?

Community Article Published October 20, 2023

In the context of the era of large models, big data and AI are undoubtedly the two most important technical ecosystems. However, the technical ecosystems of big data and AI exhibit significant division in many aspects. This division is particularly prominent in terms of storage, format, process, framework, and platform, which often poses significant challenges for developers when implementing end-to-end data processing and AI workflows. Therefore, as an open-source data lakehouse project called LakeSoul, we are committed to seeking new solutions to more effectively integrate big data and AI, bridging the gap between them. We have adopted an integrated approach of Data + AI to enable users to seamlessly connect data processing with AI model applications, facilitating bidirectional interaction between data lakes and large-scale AI models.

1.Perfect combination

1.1 Data+AI integration design

LakeSoul technology architecture successfully achieves the perfect combination of Java big data ecosystem and Python AI ecosystem, enabling support for training and reasoning in various AI frameworks. Additionally, the LakeSoul framework provides a comprehensive suite of solutions with broad applicability due to its powerful data management and computing capabilities. It supports big data computing engines such as Spark, Flink, Presto, etc., catering to common needs in stream processing, batch computing, and BI analysis. Furthermore, LakeSoul seamlessly integrates with AI and data science computing frameworks like PyTorch, Pandas, HuggingFace, Ray.

1.2 Provide a solid data foundation for AI models

The LakeSoul architecture, with its efficient and stable data processing capabilities, can easily handle terabytes of large-scale data, whether structured or unstructured. This ability is crucial to the training and reasoning of large models, because to improve the reasoning effect of large models, it is necessary to support a large number of training data. In addition, LakeSoul architecture's high-performance Native IO design can ensure the efficiency of large model training, further enhancing its competitiveness in the field of AI. In the early stage of large-scale model training, it is usually necessary to carry out strict data screening and cleaning. This process involves a lot of ETL work, and needs to rely on big data systems such as Spark for data pre-processing, including new data writing, data de-duplication, dirty data cleaning, etc., and requires multiple rounds of model training iterations for different data versions. LakeSoul's all-in-one design, with its partitioning, snapshot, and incremental read and write capabilities, accelerates the pace of model iteration.

1.3 The potential to unlock multimodal data easily

LakeSoul architecture has the ability to process unstructured data, including multimodal data such as text, image, audio and video, and these rich data resources can be used to train multimodal AI large models such as Bert, CLIP, GPT, Stable Difusion, etc. This feature of the LakeSoul architecture increases the potential for multimodal data release.

2.Application cases

The AI modeling process based on LakeSoul is shown in the figure below. LakeSoul has the ability to process both stream and batch data at the same time, and supports sample pre-processing on a data lake. With Native IO, LakeSoul can directly connect AI models. In the following part, we will introduce in detail how to improve the whole process of model training, reasoning and application by relying on lakesoul Lake storage platform and seamless integration of AI ecology such as Pytorch/HuggingFace. We've posted the full code on GitHub, which you can access at the following link: https://github.com/lakesoul-io/LakeSoul/python/examples

2.1 Start with a binary classification problem

Here we start with the classic case of Kaggle Titanic. In this case, we conducted modeling based on LakeSoul+PyTorch. Binary classification problem is widely used in the industry, such as the ranking of advertising and recommendation system, loan overdue estimation and other problems can be modeled into binary classification problem theoretically. Here, our main work is divided into three stages to illustrate: 1.At the stage of data entering the lake, the original data is imported into LakeSoul: image/png 2.In the data processing stage, we carried out feature engineering on the LakeSoul platform, including One-Hot coding, feature derivation, normalization and other operations for category features:

image/png

3.In the model training stage, we introduced the 3-layer neural network model written by PyTorch for training and verification:

image/png

Although we are dealing with static data sets in this case, LakeSoul's design philosophy and technical architecture can support real-time updating of data, real-time updating of features, and online learning of models. This case demonstrates LakeSoul's ability to handle large-scale data set computation, feature engineering, and support AI model training and validation.

2.2 NLP pre-training model fine-tuning

The previous example of Titanic has explained the process of "data entering the lake -> preprocessing -> model training". Below, a model with emotional tendency is trained by IMDB data set. Shows how to fine-tune the Trainer API based on the Bert model (distilbert-base-uncased) through HuggingFace. Compare the IMDB example provided by the original HuggingFace: https://huggingface.co/docs/transformers/tasks/sequence_classification We just need to get the data source stored on LakeSoul through the Iteratable Dataset in the code and make some adjustments on the TrainingArguments, but most of the rest of the code remains basically unchanged:

image/png

2.3CLIP based text and text search

The previous IMDB example has shown how to train a model using the LakeSoul and HuggingFace Trainer apis. In this case, we will use the Food 101 data set to show how to use CLIP model to reason samples and realize the function of picture and text search. The processing process mainly includes two stages: 1. Model reasoning: CLIP model on HuggingFace (Clip-VIT-B-32-Multilingual v1) is introduced, where clip model is used to reason on images in the image data set and the Embedding of each image is generated:

image/png

2.Semantic search: Specifically, the user can enter a text description, and the system returns the most matched image by calculating the vector distance between the text and the image:

image/png

Conlusion

In the last article, we explored LakeSoul's Data+AI design concept in depth, and in this article we give a few concrete practice cases. In the future, we plan to publish more articles detailing how LakeSoul integrates with leading open source AI frameworks like PyTorch, HuggingFace, DeepSpeed, Ray, and more. Just as Jobs redefined the mobile phone and the iPhone opened up a whole new world of mobile Internet for users, LakeSoul is redefining the data lake in the era of big models and has successfully integrated the Java big Data ecosystem and the Python AI ecosystem perfectly. With LakeSoul, developers can quickly implement applications from data processing to AI models, easily unlocking the value of multimodal data. We firmly believe that this will open a new chapter for data lake technology in the era of large models!