AI for Dog Lovers: An End to End AI Pipeline with FlashBlade — Part 1
Part 1: Real-time Data Ingestion and Processing
The AI for Dog Lovers Demo is to demonstrate an end-to-end data analytics and AI pipeline on Pure Storage FlashBlade. We ingest real-time tweets about dogs from Twitter API, index the tweets and stored the index files on a FlashBlade NFS volume. A real-time dashboard is built on top of the index for visualization. We also demonstrate storing the raw tweets as JSON file in FlashBlade S3 bucket and then performing big data analytics on the raw data. Finally, we demonstrate how to train, deploy and test an AI model that detects dogs from images and videos in near real-time.
Architecture Overview
The below is the architecture of the demo. In this architecture, we consolidate the end-to-end modern data analytics and AI pipeline on a single Pure Storage FlashBlade platform.
FlashBlade is the data hub where all user data is stored, no matter it is for real-time, batch or AI workload, all on a single datahub platform. It is possible because FlashBlade’s fast all-flash I/O and its S3 and NFS capability.
The architecture highlights the data pipeline solutions required in AI, as well as the flexibility of Flashblade. All those processes have different data access requirements, sequential and random, throughput intensive and latency sensitive and massively parallel. With this architecture, we can scale compute and storage independently based on demands. Scaling the architecture is very easy. If more compute resource is required, we add more VM or containers. If we need more storage space, we add more blades to FlashBlade. All can be done online. The result is cloud-like agility and flexibility.
Why FlashBlade?
Pure Storage FlashBlade is the heart of the architecture. But why FlashBlade? You may wonder can we do this with traditional way? Probably you can, but at a price. With traditional way, the architecture may look like this:
The software you use won’t change too much, but if you look at where the data sits, you see the difference. With the old architecture, there is data in HDD, SSD, HDFS and NFS, and we have to copy data from one place to another — aren’t we creating new data silos? We don’t want to create silos. Instead, we want to put all our data in a single data hub and use it to make our AI life easier. In this architecture, we also have to manage many different types of hardware resources, which’s complexity increases dramatically at scale. In this old Hadoop architecture, it is very easy to end up with unbalanced utilization of compute and storage, due to the difficulty in scaling resources independently.
In the following sections, I will explain in details, including code snippets and screenshots, to show you how easy it is to build an AI pipeline on FlashBlade.
Environment Setup
The architecture leverages open source softwares, We use Apache Nifi to ingest data and manage the data flow. Apache Solr is used for real-time indexing and dashboarding. Apache Spark is the go-to tool for big data processing. All these softwares are deployed and managed by Apache Ambari as shown below.
TensorFlow is used to train the AI model. We finally deploy the AI model as a REST API in a Python server.
In my demo lab, all the software is installed and running on virtual machines backed by Pure Storage FlashArray. These can also be run on a cluster of bare metal servers. All the data, including raw tweets, index files, dog images and the AI model, are put on FlashBlade, either in a FlashBlade S3 bucket or a NFS volume.
Real-time Data Ingestion and Processing
AI and deep learning are data hungry. Without big data, our AI won’t be that smart and useful.
We start from ingesting tweets in real-time using NiFi. The data flow managed by NiFi looks like this:
Each box in the UI is called a Processor. Each processor does a simple job in the flow. NiFi has hundreds of built-in processors. It is easy to use and powerful — just drag and drop processors, configure and connect them to form a complex data flow.
For the top 3 processors in the above flow, we configure NiFi to call Twitter API to collect tweets about dogs, extract fields we need for real-time dashboard from Twitter response, and then send the fields to Solr for real-time indexing. An example GetTwitter processor configuration looks like this:
A few seconds after starting the NiFi flow, dog lovers’ tweets will be ingested, indexed and stored in the system in real-time. Since data is indexed in Solr, from Solr’s Banana UI, we can easily create a real-time dashboard like the below with a couple of clicks.
In the following example dashboard, we are showing histogram of the data, who tweets most, what languages do they use, and the content they tweet about dogs.
Both Nifi and Solr mount a FlashBlade NFS volume to store its data. This is done by changing the following configurations in Nifi and Solr, pointing their data directories to FlashBlade NFS mount points.
Changes in nifi.properties
.
nifi.content.repository.directory.default=/mnt/nifi/content_repository
nifi.database.directory=/mnt/nifi/database_repository
nifi.flowfile.repository.directory=/mnt/nifi/flowfile_repository
nifi.provenance.repository.directory.default=/mnt/nifi/provenance_repository
Changes in solr.in.sh
.
# HDFS start settings
SOLR_OPTS="$SOLR_OPTS -Dsolr.directoryFactory=HdfsDirectoryFactory \
-Dsolr.lock.type=hdfs -Dsolr.hdfs.home=file:///mnt/solr/indexes \
-Dsolr.hdfs.confdir=/usr/hdp/2.6.5.0-292/hadoop/conf"
If we go to FlashBlade dashboard, we can see the traffics generated by these data.
What’s Next?
A large part of building AI is doing exploration of your data, cleansing the data and making it ready for training and validating AI models.
In Part 2 of this series, I’ll dig into the big data analytics part of the pipeline with FlashBlade S3, Apache Spark and Zeppelin.
Please note: I am an employee of Pure Storage. My statements and opinions on this site are my own and do not necessarily represent those of Pure Storage.