0

How to use Machine Learning on Azure Government with HDInsight


>>Hi, this is Steve Michelotti of the Azure Government
Engineering Team. I’m joined here today
by my colleague, Yujin Hong also on the Azure Government
Engineering Team. Welcome Yujin.>>Hi Steve.>>So, we’re here
today to talk about machine learning on
Azure Government. I feel like this
is a question that comes up a lot for us. So, why don’t you start out
by giving us some options. What are the options
we have available for ML on Azure Gov today?>>That’s a great
question. So, one of our VM offerings is actually the Data
Science Virtual Machine, which pretty much comes
pre-loaded with all of the frameworks and tools that one might need for data science. Things like Spark, TensorFlow, Sequel Server, as
well as Power BI. Another VM offering that we
actually have is ML Server, which you can actually go through the whole machine learning
and deep learning pipeline, all on that one VM.>>Okay. So, when
you say VM offering, these are things
that we just install straight on a virtual machine
and you can spin them up.>>You just click “Deploy” and they’re ready to use right away.>>Okay. All right, cool.>>We actually have
Azure data works coming to Azure Gov by
the end of this year.>>Okay.>>So, that’s something
to look forward to as well as HDInsight, which I’m actually going to be showcasing
in my demo today.>>Okay, great. So, HDInsight. So what is HDInsight? Can you give us a little bit
more information about that?>>Right. So,
HDInsight is actually a fully managed Azure service that pretty much deploys
some provisions, Apache, Hadoop clusters
into the Azure cloud. So, what this really
means is users can customize and control
their clusters. So, they have a bunch of open-source offerings
such as Hive, Pig, Hadoop as well as
Spark that they can choose when they’re deploying their cluster as
well as the scaling. So, if you have a large data set, you might want to have more nodes or cores
on your cluster. So, you can easily configure and customize that
when you’re deploying.>>So, your’e say
I can say, give me 100 node cluster or maybe
I only need 10 nodes.>>Exactly.>>Or give me 100 but
then I’m done, lot more.>>Right. So, that’s a great part of it is that you can
save a lot of money by just deleting
your cluster after you’re done and spinning up a new on when you want
to use a new one.>>Awesome, okay. So, how do I get started with machine
learning using HDInsight? I heard you mentioned some things like Spark. Tell me
more about that.>>Right. So, I’m actually
going to focus on Spark today because that is what I’m going to
be using my demo. So, the great thing about
Spark is it operates on the power of Resilient
Distributed Datasets or RDDs. So, what this means is, if you have a large data set, you can distribute them
evenly across worker nodes. So, these are the RDD
data structures. So you can perform
transformations and queries in parallel. So, that allows for much faster processing especially if you’re working
with large data sets. It also supports
in-memory caching. So, Azure going through
the transformations and queries, Azure going through like a
machine learning pipeline. That saves in-memory.>>Okay. So, Spark is some kind
of Microsoft creation?>>Actually no. Spark
is owned by Apache. It was actually created by
students at UC Berkeley. So, it’s an open source framework and HDInsight just
happens to support it.>>Okay. So, for anyone
that’s ever had to provision a Spark cluster before
manually, it’s a pain.>>Yes.>>With HDInsight,
click a few buttons, boom, there’s your cluster.>>Exactly.>>Okay. Make sense. So, what does the ML process look like?>>So, I have kind of a little
diagram over here that I’m going to be
referring to my demo as well. So, this is a basic pipeline. Some data scientists
might not use this. This really depends on the type
of data that you have, but these are usually the
steps that you go through. So, first is data acquisition. Data is the most important part, you can load that
into Azure storage, then you want to
pre-process the data. So, that can mean
filtering for null values and only narrowing down your data set to categories
that you’re interested in. Then, data exploration
is an ongoing cycle, as you can see from the diagram, because as you’re going through and filtering and
cleaning your data, you’re going to learn
more about it and you’re going to see the patterns between the variables
and that will help you clean your data
more effectively.>>Make sense. So, it’s sort of an iterative
process so to speak?>>Yes. Most data scientists from my knowledge go through that process way more than once.>>Yeah.>>Then once you have
a clean data set, you can train a model
on that data set. So, actually, which
I’m going to show you, Spark comes with a bunch of algorithms that are
already pre-built, but you can write one
from scratch as well.>>Okay.>>You can train that model
on your test data set. Then you want to test your model. So, you usually have your training data set
and a test data set. So once you train your model, you can test it and
get an accuracy. Then, once you’re happy with
you have a high accuracy. You’re happy, you want to save your model and you can
deploy your model. So, deploying your
model could mean wrapping it up into
a container and accessing it through a web app or an API call as well as
using the Livy service, which is a rest
API based service. Basically, you’re
making API calls directly to a Spark cluster.>>So Livy is some native API.>>Yes.>>Made of just Spark?>>Yes.>>Okay.>>Yes.>>So, I also heard you talk about Docker which
is interesting. So, I could containerize my model and deploy
anywhere I wanted really. Didn’t even have to be in Azure, I could train using
big Azure and some massive 1,000 node cluster
deployed anywhere.>>Exactly.>>Okay. All right, great. All right, so, this
is great information. I really like this diagram,
keeps it straight. So, tell us about the demo.>>So, just a quick overview of what we’re going
to be looking at, we’re actually looking at
two different data sets. So, we have a flight data set and a corresponding
weather data set. The flight data set is actually from the Bureau
of Transportation. They’re recording
variables such as airport, carrier, total number of flights and the delays for each
of those variables. We have corresponding
weather data which would be like humidity and temperature. So, the purpose of this demo will be to
predict how likely it is for a flight to be delayed based on all these variables
that I just mentioned. So, the process is loading the data from Azure storage account
or wherever you have your data stored into your Jupyter notebook on
your HDIsight cluster.>>Okay.>>Then, here you do
all the data processing and you actually go to
the machine-learning pipeline, you spit out a model and you can save that model right
back to Azure storage.>>Okay. So, I hear
you describing that diagram on the last slide.>>Exactly.>>Cool.>>All right. So, here we
have our Jupyter notebook. This is actually running on a Spark HDInsight
cluster right now. So, we’re predicting
flight delay. I’m actually going to
bringing the same diagram that we had before. So, as we go through
this notebook, we are going to hit
each of these steps. So, first step, data acquisition as you can see from
the highlighted orange. So, the great thing about HDInsight is that it has seamless integration
with Azure storage. So, I actually have my data loaded into
my Azure Blob storage account.>>So, that’s the WASP weird
looking thing right there?>>Exactly.>>Okay.>>That’s a direct
address to my data.>>Okay.>>So, this sc.textFile, this line of code right here is actually just bringing
in that data into my Jupyter notebook as an RDD which is a data structure
that Spark works with.>>Okay.>>So, now we can see
here’s the actual raw data. We know that we’ve
loaded in incorrectly. However, if you look at it, it’s kind of hard to look at. It’s very unstructured, which is most raw data is not going to be super pretty when you
first load it in.>>Got you and you have
online nine there, just take three, meaning show
me the first three records.>>Exactly.>>We know it’s a lot
bigger than that.>>Right.>>But just show me at beginning.>>Right. There’s
other helpful methods like count just to make sure you have the right number of rows. So, that actually leads us
to our next step because we want to transform this raw data into something we can work with. So, the next step is
data pre-processing. So, it really depends
on what kind of data set and what models you
want to train your data on, but the steps that I
decided to take was, first, cleaning the data and
then building a data frame. So, data frame is
like a Sequel table. You have the categories. It’s a structured data sets. It’s much easier to work with. So, here are some
of the categories that we really want to look at. So, as I mentioned before, we are working with
two different data sets. I’m going to go through all of the different steps on each
of the data sets separately, and then at the end,
merge them together. So, here, we’re starting
with our airline data set, we want to look at the categories
such as year, the date, the time, the originating airport of the flight and then
the destination airport, total number of flights, things like carrier
and the actual delay.>>Okay.>>It’s a whole bunch
of data we have here.>>Yes. So, here is actually a schema that you can print
out of your raw data. As we can see, there’s
a lot of different columns, the types are different, there’s integers, there’s
doubles, there’s strings. So, what I first did was filtering the data
for null values and converted types into
the DataFrame that I wanted.>>So, we’re talking about
their pre-processing. It’s the janitorial aspects.
We’re cleaning it up.>>Yes.>>So, it’s in this case, your example is removing the null values or
the data types, could be anything but these
are just examples from yours?>>Exactly.>>Okay, makes sense.>>As you can see
from the screen, it’s actually just Python code. So, what I did was, I defined a custom Python
function and I can still use the power of RDDs
and Spark, by mapping. So, this line of code right here, is where I’m actually mapping. The custom function
that I just defined onto my RDD or the dataset
that I applied.>>Okay, so there’s
cleanup functions for removing null values that’s
happening on 115, right?>>Right.>>Okay, cool.>>It’s just showing that you can just write free-form Python code.>>Great.>>Right, with
your own functions. So, this step is where I’m actually going to
construct the DataFrame. So, this block of code
is where I’m defining, here are the different
categories that I want, here are the different types
I want them to be and then I’m going to map with this.createDataFrame
line of code right here. I am just mapping this construct that I defined
onto that raw dataset.>>Okay, so, more cleanups?>>Yes.>>All right, cool.>>So, here is kind of
a nicer view of that data, so that’s the power
of DataFrames, and I went through the same steps for the weather data so
I’m going to skip it. I do want to kind of talk about the different aspects
of the weather datasets. So, there’s like wind speed, humidity, Celsius in
the same date in the airport. So, that’s how we’re
going to merge the two datasets together. So same thing,
filtering the data. Here, I am going
to do a grouping, the weather data by
the date in the airport, because that’s the kind of format and aggregation we
need of the data. So, here you can see it’s
just a SQL query, right? So, I can just write
free-form SQL queries and still use the power of Spark, and Spark can still perform
those SQL queries on my data.>>I see, so it’s interesting. So if I’m more
comfortable with SQL, I can do that but if I like
the Spark syntax with Python, I can do that, either are okay?>>Exactly.>>All right, make
sense. All right, cool.>>Exactly.>>Let’s keep going.>>So, there’s my weather
DataFrame. All right. So now we are kind of at
the data exploration step. So, now we are just going to plot the different
relationships between the variables and
the number of delays. So, for here, we try to see how many carriers have
the highest number of delays, airports, same thing, and
actually, I wanted to plot-. There are a lot of airports
with less than 100 flights, so I wanted to filter
out that noise and to improve the accuracy
of my model later on. So, now I’m actually kind of going back to
the pre-processing, so here’s an example of
that iterative process. So, I’m going to clean
my weather dataset. Now, that I know that I want to filter
out some of that noise, I’m going to go ahead
and do that. Here’s-.>>But this already does look
like the iterative process. We acquire the data,
we pre-process it. As we explore it, we might decide we want to
do some more pre-processing.>>Exactly.>>Okay.>>So now that we’ve
gone through that step, we want to merge the datasets
like I said before, which is what this SQL
statement is doing here. Then, now I have
this joined dataset. I’ve cleaned everything. I’ve gone through pre-processing, now I’m ready for the next step, which is training the model. So, there’s a couple of steps
involved before you can actually run the algorithm
on your data. So, you need to split
the training and the testing data like
I mentioned before, which is what I did here,
usually your training dataset is larger. Then.>>So it might be a case
where you’re training on 75 percent of your data
and assessing on 25.>>Yes.>>The remaining
25 percent. Okay.>>Then, you want to actually restructure the data in order
to create feature vectors. So, feature vectors is just the format that
your data needs to be in for the algorithm to process it and
actually train on it. So features would be like the variables that
I mentioned, the time, the airports, the carriers, the weather and you end up with a label for
that feature vectors. The label would be, hey, one, there is a delay, zero,
there is no delay. So, this pipeline that
I defined over here, I wanted to pause at this because I wanted to
talk about pipelines. So pipelines is
a really cool feature that’s included in Spark. Basically, it allows you to
define different stages. In this line of code right here, we can see there’s all these stages that
I defined, right. What I’m going to do is, with this one line
of code where I just fit my data
into this pipeline, it’s going to go through
each of those stages.>>Okay.>>So, it just kind of
reduces that hassle.>>So, fit is the magical line of code while training model?>>Yes.>>Okay.>>Exactly. So, now I’m actually going
to train the model, run the machine
learning algorithm. Before that, I was just
creating the feature vectors, so same thing I’m actually going to have a pipeline as well. So, this is the
magical line of code, where I’m actually pulling
in a pre-built model. So, I’m going to use a
logistic regression model, this is all pre-built, comes with Pi Spark
machine learning library.>>Right, the
logistic regression, that sounds complicated. Did you have to write
that all yourself?>>No, this one line of code right here is
all I had to do. So I just had to give it
here’s my features column, the name of the features column
and then I can give it some parameters and then right up right away I can just run
it through this pipeline.>>So, some data
scientists will write logistic regression and then we can just pull
that into archive.>>Exactly.>>All right, cool.>>You can feel free to tweak it and write
your own algorithms, that’s completely fine, but these have been proven to
have a high accuracies. So, now that I’ve
actually fit my model and I have machine-learning
model now, I want to test it. So this is kind of
the same magic, you can just bring in, a machine learning library comes with these
classification metrics, which gives you the accuracy. I mean, you could go
through and map, hey, how many of these predictions
are true and then divide by 100 and get your accuracy but
this is just a lot cleaner. So, we have our accuracy
right over here. Now that brings us to our last
step, deploying the model. I just did want to mention, I did only train on one logistic regression model but there are
so many others that you can bring in for binary classification as well
as multi-classification.>>I see. So, not only can
we iterate on the data, cleansing process but we can even iterate on
training the model. Trying different things out.>>Yes, exactly.>>Sometimes we think
of machine learning as this magic wand but a lot of times it’s just
some trial and error.>>It is.>>Getting thing’s right.>>There’s been times when
I’ve ran through like six or seven different
algorithms to get high accuracy.>>Great, cool.>>Yeah. So, now we’re at our last step
deploying the model. So as I mentioned before when I was going
through the slides, deploying the model could mean you wrap it
up with container. Now, that we have that
model that I just said, we first need to
actually save it. So it’s this one magic line
of code right here, you’re saving
that whole pipeline.>>Okay.>>You save it directly
back to your Azure storage, as you can see it’s
that same address right here. Then, you can load it back in with another one line of code.>>So, we can run it in Livy or if we want we could
deploy it to container.>>Or, you can open in
another Jupyter notebook in a different cluster, right.>>Okay, awesome.>>Yeah.>>All right. Well, this has
been a great discussion on machine learning in
Azure Government with HDInsight. This is Steve Michelotti from the Azure Government
Engineering Team, with Yujin Hong
talking about machine learning on Azure Government.
Thanks for watching.

Stephen Childs

Leave a Reply

Your email address will not be published. Required fields are marked *