Data and AI Skills, Better Together

View from a “Data Scientist” at a Storage Company

Yifeng Jiang
4 min readDec 5, 2023

When I started building my first big data cluster using Apache Hadoop in 2009, I had no idea I would continue this career for 15 years and extend it into AI. Through the years, I learnt that data and AI are better together. I learnt that organisations with better data infrastructure tends to gain more value from their data and AI. I realised that having skills in both big data and AI, either as an individual or a team, are very beneficial for the coming AI age.

Photo by Joshua Sortino on Unsplash

Data and AI

AI, especially LLMs (Large Language Models) are the cool topic these days. Everyone I work with talks about ChatGPT and LLMs. No doubt LLMs are amazing, but I think it is equally important to also talk about data. Because without large amount of high quality data, smart AI like LLMs won’t be possible. An AI system is really both the data and the model.

As one works in the data and AI industry, I was happy to hear about the news that in the Singapore’s updated National AI Strategy (NAIS) 2.0 launched yesterday (Dec. 4), the government recognises the importance of AI and data.

To make Singapore a more conducive place for AI value creation, the government will increase the availability of high-performance computing power and access to data.

Data storage — a foundation technology

Infrastructure for large AI system needs to be able to support both big data and AI workloads. It will be much easier to use or build an AI system for business if you already ingested your operational data from multiple sources in a single repository in an easy-to-access way.

Data and AI infrastructure

One critical building block for big data and AI infrastructure is data storage. Many data scientists may not aware that the NFS folder they use for their training datasets and model checkpoints usually comes from a high performance all-flash data storage, which can send data directly to their GPU’s memory to speed up their AI training job.

Data storage is one of the foundation technologies for big data and AI. It is often the case that data engineers and data scientists who understand foundation technologies are more likely to build better data pipelines and models.

A “data scientist” at a storage company

As a data and machine learning engineer, or “data scientist” as many call it, there are so many opportunities for such skills, why do you work for a storage company? I sometimes get this question from friends, clients and even interview candidates. I am passionate about big data and AI. Data is the common piece here and storage is the foundation. I enjoy working on foundation technologies. I guess this is why I am working for a data storage company.

Although often time I am introduced as a data scientist in the meeting, my official title is “Field Solutions Architect, Data Science”. It’s not the same, but I gave up explaining the difference long time ago. I do build data pipelines and models from time to time, but my main priority is to support our local team in the field engaging and helping customer build big data and AI system on our storage products. I like to travel and meet customers, present at conferences, writing code and building the lab. I think it is the mix of these makes the role unique. It is not perfect of course. One challenge I ofter have is, how to effectively help storage people (mostly colleagues and partners) understand AI, and help data scientists (mostly customers) understand storage, and bridge them together.

Data and AI technical skills

In a recent PoC (Proof of Concept) AI project, I helped a customer tune their PyTorch code to gain the training speed they expected by moving from a single to multi-GPU and shared storage architecture. It was just a few lines of code changes to avoid unnecessary gradients broadcasting and data serialisation, but it increased the training speed by several times. Familiar with PyTorch’s Distributed Data Parallel (DDP) is one reason, I think the other reason I was able to quickly identify the bottleneck was my experience in big data and distributed system, and being able to apply it to AI. Data and AI skills together is critical to my professional job.

Another foundation technology in this area is Kubernetes. AI runs in containers. Kubernetes is the most popular container orchestrator. Many big data softwares including Apache Spark are also embracing Kubernetes as the standard resource scheduler for running on many servers.

All these technologies have breadth and depth. For example, in data storage, there are Block, File and Object Storages. Each of these has its role in a data and AI pipeline. If we move up the stack, Container storage for Kubernetes (CSI) kicks in. Sometimes it is interesting to move down the stack into the storage implementation itself, where we also find the depth — SSDs vs. DirectFlash, B-Tree vs. LSM-tree etc.

Back to the basis

Generative AI and LLMs are growing fast. For fast-evolving technologies like this, whether we want to just use it or build our own LLMs, it is important to understand the technology at least at a high level. For one working in this industry, understanding data and AI foundation is essential. I haven’t met a single person who masters all of these skills, but I think it is crucial to understand the broader foundation technologies fairly well, and be really good at one or two. Because this will put us at a unique and strong position in the market.

Enjoy and keep learning.

--

--

Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.