2022 in Big Data and Machine Learning

A review from a field data and machine learning architect

Yifeng Jiang
6 min readDec 30, 2022

It is a beautiful sunny Friday morning today. I am sitting at a table near the window in a nice cafe, listening to the peaceful music, looking at the trees through the window. I am starting my last workday of 2022. There are only some system-generated reports in my email inbox. No notifications from Slack. The project in my VS Code can wait. This is a perfect environment for a retrospective review of the year.

Photo by Bonnie Kittle on Unsplash

Like most people I know, the biggest change in 2022 for me is travel and meeting people. I have restarted travelling personally and professionally from April 2022. Since then, I have travelled to the UK, US, Japan and several Asian countries. I have met a lot of people in big data and machine learning again — my customers, prospects, partners and colleges. I learned lessons, revisited and verified my thoughts, and discovered new insights from the field. In my last 2022 blog, today I want to share my review as a field data and machine learning architect, from both market and technology perspectives.

There is no magic — AI and ML are maturing

I remember two or three years ago when I was watching the Hakone Ekiden live on TV, the announcer was saying something like “based on the AI prediction, here is the likely result of this year”. Maybe there were some statistical numbers or software behind the scene, but certainly not AI. Back then, AI was a buzz world. AI was everywhere — TVs, homepages and RFPs (Request for Proposal). We simply replaced “software” or “feature” with “AI”.

This is different in 2022. The AI hype is decreasing, which I think is good. We saw less AI on TVs, ADs and RFPs in 2022. Organisations realised that they cannot just download or buy an AI to do the magic. Many learned this from hard lessons. Many organisations bought very expensive hardware and software for their AI projects, only to realise that they didn’t need an AI, or they didn’t have the skillsets to run a successful AI & ML project. These expensive equipment ended up staying in the datacenter not being utilised. So organisations stopped buying. People become more cautious about AI. This certainly hurts AI vendors. However, I think this is actually good. Realistic expectations are a positive change. It is healthy for the industry for the long run, because that is a signal that the industry and market are maturing. Resources are being allocated properly. Organisations that succeeded in AI projects are expanding. For example, in APJ, where I spent most of my time, I saw quite some of these expansions in Korea and Singapore.

We may have not yet seen all the promises of AI, but it is making progress in a healthy way. In 2021, companies used AI to speed up COVID vaccine trial. In 2022, scientists are using AI to dream up revolutionary new proteins. ChatGPT can answer almost any questions in a conversational way , although not 100% correct.

AI & ML are real. Therefore, as a customer, the questions we need to ask are: What an AI system can really do for me? Do I need to build and own an AI system? What are the foundations for a successful AI project? Who are the experts to ask? For vendors, it becomes even more important to qualify for an AI opportunity, and be a trusted advisor rather than a seller to your customers.

Back to basics —fixing the data system first

Maybe AI is not for every organisation in 2022, but data is definitely for everyone, especially easy-to-access big data is critical. The need to improve and scale big data systems remains strong among enterprises and government agencies, across countries in APJ. FSI, telecom and government are the top-three sectors for strong big data analytics. A couple of reasons here, including:

  1. COVID has accelerated digital transformation and made it a mandate for many businesses, which makes data analytics more important than ever.
  2. Security has never been more important than 2022, for both in the physical and cyber world. Information (data) is the key to ensuring security.
  3. Many AI challenges are data challenges. This is especially true for large organisations. To succeed in AI (and therefore gain competitive advantages), one must have a solid big data system.

Many organisations’ big data systems were built years ago based on Hadoop. It is very difficult and expensive to build, use and operate. Often, people spend more time on managing the system than analysing the data. We must go back to basics to fix the data system issue first. Many organisations finally realised that it is too complex to do big data with traditional technology like Hadoop. Organisations started to embrace new approaches such as:

  • Embrace as-a-service or like-a-service architecture, like separating compute and storage and using object storage for its data lake. This is true for not only systems in the cloud , but also on-premise environments.
  • All or hybrid open source solutions. COVID, supply chain and other facts have challenged enterprises and governments to spend less for more. Software license cost is a big chunk in big data system spending. Because of this, some skilful organisations have chosen to go all open source software. While many others who were still building their engineering team, chose to use a hybrid approach — enterprise data infrastructure solution (e.g., commercial Kubernetes and object storage), plus open source software (Apache Spark, Trino, Jupyter, etc.) for compute.
  • Use a specialised system for some particular use case. Take log analytics as an example. Rather than building a general streaming system for this, many chose to use specialised solutions like Splunk and Elastic.

The goal is to make big data easy to access, operate and less expensive.

The future is bright — exciting technologies and promising outcomes

Organisations are improving big data systems in different ways. There are common things — technologies are exciting and outcomes are promising. The future is bright in big data and ML.

Here are some topics and technologies that showed up in most of the conversations I had with customers and partners in 2022.

Simple system for ordinary tasks. While exciting technologies and promising outcomes easily draw the attention, we must not forget that many organisations just want to get their big data and ML tasks done in a simple way. Topics like running Apache Spark on Kubernetes with S3 data lake, or scaling a Jupyter notebook beyond a single machine, or querying CSV files in Trino, remain the most popular in my customer conversations and blogs.

Data Lakehouse Architecture. While not new, data lakehouse is definitely gaining more attention in 2022 than before. Some organisations I talked to are either in the process, or seriously considering deploying a data lakehouse. Some use open source (e.g., Delta Lake and Trino) software, others choose commercial solutions (e.g., Dremio). My data lakehouse blog is the most popular among all my 2022 blogs.

Consolidation and acceleration. A few years ago, I had to spend lots of time to explain why separating compute and storage is the direction for big data and ML. In 2022, more and more customers came to ask me how to do this and finally consolidate and accelerate their system for both storage and compute, for reasons like better ESG (Environmental, Social, and Governance), operation, and efficiency. I have summarised my thoughts in my Smaller is Better — Big Data System in 2023 blog, but in a nutshell, unified fast file and object storage like FlashBlade enables storage consolidation and acceleration. Kubernetes, GPU and Spark RAPIDS make compute consolidation and acceleration promising.

Feeling excited for 2023

I can only share a few of my observations and thoughts about big data and machine learning in 2022 in this blog. So many innovations and changes have happened. Can you believe it is only one day to enter 2023? I hope you are feeling as excited about big data and machine learning in 2023 as much as I am.

See you in 2023.

--

--

Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.