Hello and welcome to this
introduction to Data Science webinar.
My name is Doug Rose,
so I hope you enjoy this presentation,
this webinar on introduction
to data science.
So let's get started.
Data science is using a scientific
method to explore the data.
That's a really key concept to this entire
webinar, is it's using the scientific
method, you're exploring the data.
You're looking for things.
You're not sort of you don't really know
what you're going to find until you
start kind of going through the data.
And that's kind of the science
and data science.
It's about the scientific method.
So let's go through kind of like
the different types of data.
The first is structured data
structure data is kind of traditional data
is one of the oldest data types,
and then there's semi structured data,
which has some aspects of structured data,
but also some aspects of what
you'll see now, unstructured data.
And then there's third,
there's sort of unstructured data.
So now let's go kind of let's figure out
what the differences
between these three types.
Structured data has sort of clear,
predefined, detailed format, so
you'll have you'll know what's going
in and you'll know what to expect before
the data gets populated in your database
or data cluster or whatever has a specific
format and it comes in a specific order.
So that's all great.
But let's think about
how that breaks down.
So think about something like an address
book, which is a pretty good
example of structured data.
You have a really good sense of the kind
of data that's going to end
up in your address book.
You have a first name
and that's going to typically be text
and you have a last name
and that's also going to be text.
People don't typically have a last name
So, you know what's going to come in,
you know how it's going to be labeled
and you know what kind of format it's
going to be in and it's going
to come in a specific order.
So you're going to have like, you know,
typically people will type in their first
name and then they'll typically
type in their last name.
So it's going to have this order to it.
Another way to think about it,
something like a recipe book.
You know, again, you'll have something
that comes in a very specific order.
You'll have kind of ingredients and then
you'll have metrics like one cup and then
two cups or something like that.
So you'll know what to expect and then
they'll kind of be the text of the recipe.
So you could think of this really is
structured data and there's a lot
of data out there that's structured.
If you think about credit reports,
if you think about, you know,
address books, things like that, a lot
of people work with structured data.
Almost everything that you
work with in a spreadsheet or something
like that is structured data.
You know, you have your columns,
you have your rows, you have your format.
You kind of know what's
going to go into it.
So there's a lot of data out there
in that structured,
but there's also a lot
of semi structured data.
And it's gotten a big boost the last
couple of years because of
things like weblogs or something
like that, where you have the data.
It has a logical flow and format,
but it's not user friendly.
So if you've ever seen a weblog or
something like that, it's got
the date, it's got the time.
But then pretty much anything that comes
after it could be
kind of unstructured, whatever.
You know, you could have a long text
that describes somebody logging in.
It could be useful.
It could be a ton of different stuff.
So on Semih structured data,
you've got some elements of structure like
the address book,
but you've got a little bit
of unstructured data, you know,
some text or something like that.
And if you've ever worked on a Web server,
if you've ever had any logging for servers
or even on your computer, you'll see
that they can get huge really quickly.
So it's kind of tough
to get get through it.
Not not very many people can kind of look
through a weblog and immediately see
what might be wrong with the server.
So you'll need some sort of program or
something that turns it into a graph or
something pretty so that you can
kind of understand what it is.
So, again, that's a difference
in the structured data where you have your
address book where anybody can kind
of look and understand it,
and then you have your SEMIH structured
where you really have to take some effort
in understanding what's in the data,
get some sort of visualization
or something like that.
So you could see a chart or
understand what what's in the log
or in this semi structured data.
Now, the third type is unstructured data,
now unstructured data is typically called
schema lists, and
we talked about the address book is
structured data and you have a first name
and a last name and an address that is
typically called the database schema.
You kind of understand what's
going to be in the database.
So you can you can label the rows and you
kind of understand what's going to come
in there with unstructured data.
You don't have this kind of schema.
Could be anything.
It could be just some random text.
It could bean image.
And so a very good example of unstructured
data is something like social media.
If you look at your Twitter feed or
something like that, you'll see that
there's some videos, there's some texts.
And the kind of text that people
use isn't very structured.
It could be a question.
It could be broken into two parts.
It could be filled with emojis, it could
be text and then a video or even audio.
So, you know, that's unstructured data.
And so you'll see there's
varying types of data,
audio, video, different types of video.
And it could come in different formats.
I mean, you could have an MPEG three
audio file or an ad for a
audio file or something like that.
Or you could have with video,
you could have MPEG two.
It's unlikely, but you could have it or
MPEG four, it could come
in different formats.
The the text could be
formatted in a different way.
And so it's more difficult and sometimes
it could even be in different
languages or something like that.
So unstructured data is everywhere.
And with social media
and people uploading videos to YouTube
all the time, it's only increasing.
Now, one of the challenges with data
science is that 80 percent of the data
out there is unstructured data.
And this is the big trend a couple
of years ago in Big Data know
that there was these massive amounts
of new data that was flooding
people's servers and kind of trying
to figure out what to do with it.
And Big Data is a little bit related
to data science, because a lot of data
science uses big data,
uses kind of unstructured data,
massive petabytes of unstructured data
and tries to get some value from it.
So something like YouTube is a really good
example of big data that's unstructured.
We have millions of people who are
uploading videos every day in some audio
and things like that and could have
different formats and different structure.
And the big challenge that Google has is
to try and figure out how
to categorize it, what to do with it.
And so, you know, that's one of the big
challenges with data science teams,
so you have these three types of data
structured, semi structured
So now one of the big challenges is now
in data science teams is that you
have these three types of data.
You have to figure out what to do with it.
How are you going to analyze it?
Remember, data science is about kind
of using a scientific method
to understand your data.
So you kind of you're going to run
experiments, you're going
to look at the data.
You're going to try and figure out
how to extract value from the data.
So to do that, you need to analyze it.
And one of the.
And so one of the ways that you analyze it
is you can put it into sort
of a database management system.
So there's different types of databases.
And that came up in the nineteen sixties,
a lot around the space program
that started out with IBM's
information management system.
And then later on, there were
the relational model of databases.
probably your organization has
a relational database that RDBMS
relational database management system.
Maybe it's like my school or Microsoft
school or something like that at school
stands for structured query language.
This is kind of one of the ways that you
extract data from these databases.
It's a language to run reports into to try
to get joint tables and things like
that so that you can mix and match
your data and extract value from it.
Two thousand for the work
they came up with.
Google came out with something called
the Big Table,
which was one of the big impetus for big
data is that there was a big data, big
table that was replicated with a base.
it was a way to sort of put massive
amounts of unstructured data
into sort of a management system.
And then later on,
you have sort of Cassandre,
you probably have Cassandre in your
organization and no sequel,
which is a sort of a cluster,
a way to manage unstructured
data in a massive cluster.
And and it's so you can sort of take
any data and put it into the cluster.
So most of these clusters are
working with unstructured data.
You'll have, you know,
clusters of servers that pull in video
and audio and try to manage it.
So these are sort of like the different
databases, different way
to to manage your data.
So you should sort of be familiar
with them if you're going to work
with data science teams,
is that these are they go all the way back
from nineteen sixty six where you have
this kind of structured data database with
Eben's information management system.
Then you have relational databases
that can take different tables,
almost like spreadsheets,
and link them together and run reports.
And then you have later on and you do that
with skill or structured query language.
Then later on you have big table,
which starts to work with unstructured
data and Cassandre and no sequel.
but when you're working with typical
databases, you know,
some of the oldest ones, you'll have
something that looks a lot like this.
You'll have sort of a database table.
And this is very similar to the address
book where you have a first name,
last name, date of birth.
You kind of know what to expect.
Social Security number, you know,
that's going to be a number
and not something like,
you know, a word place of birth.
You typically know that's going to be text
and date of birth,
typically know the format.
And so this is very structured data.
So it's easy to manage.
Now, what relational table relational
database management systems do is they
they take these different
spreadsheets or tables
and link them together so that you can
sort of create reports, does this
through a structured query language.
And so this is a lot of this is often what
this looks like is a very simple database
where you have a series of different
tables with structured data and then you
link them together
as a way to kind of manage your data,
as a way to to create reports or views or
to look at it and try to understand it.
So your data science team will will most
likely be working with some relational
databases, some structured data,
and it will look a lot like this.
And then they'll usethe data.
Science teams will typically
also use no sequel.
And so you might have Cassandre,
like I said, and these use kind of typical
what's called key value stores.
This came out in nineteen ninety eight.
It's a key value stores is a basically
a way to kind of create a key.
So a name for something you could just
call it sort of like video
and then put like a value in it.
But it could be anything.
The these are schema lists.
Remember I talked about how
databases typically have schemas.
We have like first name, last name.
And so you kind of have to know what's
going to go into your database
before you start collecting data.
So imagine, you know,
in a relational database, remember,
we looked at all those different
tables that are connected together.
If you want to add some data,
let's say that you did first name,
last name and started collecting data.
And you're like, oh, my gosh,
we have to collect people's middle names.
Then you have to update the schema and
either go back and try to collect people's
middle names or accept the fact that some
of the people in your database are going
to have their middle names
and some people are not.
And so that was kind of a big
challenge with relational databases.
Not only sequel doesn't have this
challenge as much because it's seamless.
It just uses these key values.
There's kind of a key and then
there's a value attached to it.
And so you can kind of expand
your data as much as you need to.
And so a lot of times this is called
if you and if you're collecting massive
amounts of data, one of the advantages
of no skill or not only skill sometimes
it's called, is that you can
continue to add servers.
So one of the big challenges
with a relational database management
systems is that in scale as well for these
massive data collections,
because you have to as you as you saw,
you have to sort of connect
these different tables together.
And so as you start to collect massive
amounts of data, you have a real challenge
in scaling that out with not only skill,
you have something called horizontal
scaling where you just kind of throw more
servers in and then it's sort of
cross replicates data among
the different servers.
So not only SQL or no SQL scales,
it's more flexible because you
just have these key value stores.
You don't have to deal with database
schemas like you do
with relational databases.
It doesn't work as well for structured
data because if you know what's going
to go into your database,
there's not really that much of a need
for using sort of a no SQL cluster.
But if you're going to do something like
collect video, audio, text,
or if you want to collect social media
streams or something like that,
then no sequel is really
kind of the only way to go.
It's really difficult to do
that with relational database because you
have to know you have to create a schema.
So you have to know all the data that's
going to go into it before
you start collecting data.
So, you know, no sequel's been
a really big thing for data science.
It's was also a really big thing for big
data because it gives these data science
teams a lot more room to explore.
You have you can collect a lot more data
and you can run some really
interesting reports sometimes.
So people consider it's kind of a downside
is that since it's such a massive data
a lot of times, no, SQL clusters aren't
really good for quick transactions.
Like you wouldn't want your bank to use
something like no sequel because it's got
something called eventual consistency,
which means that if you were using this
for your bank and you like withdrew money,
that it wouldn't instantly show
the update in your database.
It was kind of eventually
make it consistent.
So it takes a little bit longer.
So a lot of banks,
a lot of financial institutions still use
But for for a couple of reasons.
One is because almost all the data
they receive is structured.
It's your name, it's your
amount in your account.
It's how much you withdrew,
how much you deposited.
But the other reason is because, you know,
you're it's it's it's a little bit slower,
you're not going to have the you want
to have this immediate transaction
when you're working with banks
and things like that.
So now that you've seen sort of the data
science teams, you've seen the types
of data, you've seen the different ways
that data can be stored in a database,
if you're working with the data science
team, you're going to want to be familiar
with the different ways,
the different kind of servers
and technology that your team is
going to be using or the team.
Or if you're working on a team,
you obviously want to be familiar
with some of this stuff.
So now let's think a little
bit about the science.
What's the science part of data science
and the science part?
Most of the main tool you're going to be
using for the science and data
science teams is statistics.
And this is an old Mark Twain quote
that there's three kinds of lies,
lies, damn lies and statistics.
And you know, what he meant by this is
that you can use statistics to kind
of people tend to think
of statistics as pure math.
And so a lot of times you can
sort of fudge things with statistics
and make it look like the truth.
And so you have to watch out
for that with your data science team
and we'll see a little bit
more of that later on.
But statistics is
the definition of statistics is
the science deals with collection,
analysis, interpretation of the numerical
data, often using probability theory.
So so, you know, we'll learn a little bit.
You'll see a little bit about
probability theory later.
But, you know, and you've seen a little
bit about the collection
and analysis of data.
Now we're going to think a little
bit about the interpretation.
Now, we're kind of always using
statistics, we don't
really think about it, so,
you know, my son is is in school and so
we're always watching his
GPA and that's a statistic.
You know, people who are sports fans
are always working with statistics.
You know, a lot of times
the statistics give us comfort.
You know, my wife doesn't like to fly very
much, especially if there's turbulence.
And so I try to remind her
that statistically this is really the
safest way to get to where we want to go.
It doesn't help.
But, you know, that's a way to kind
of use statistics to think about it.
And so we people use
statistics all the time.
If it if you're in the political season,
in your right right now, what whatever
in your state or country that you're in,
then you'll see lots of statistics,
you know, jobs reports
and some politicians will say, you know,
our unemployment went down and then
another competing politician will say,
yes, but income, you know, also went down.
And so people use statistics to try and
kind of tell their version of the truth.
And this can sometimes
be a dangerous thing.
But one of the the most common statistics
that you'll see even with your data
science team is something
called descriptive statistics.
And so the descriptive statistics is
typically the two tools that you'll see
the most are the the
the median in the mean.
And they're trying to describe
large groups of numbers.
And so that's why it's called
And that meeting, the mean challenge is
one that you'll see a lot
in data science teams.
In fact, you can take that same group
of numbers and it might mean something
different to you based on whether
you use the median or the mean.
the the mean is typically sort of you can
think what is the average is the group
of numbers, and then you multiply them up
and then come up and then divide them
by the number of values that you have.
And you can come up with kind
of an average something that represents
generally describes the whole group.
And a GPA is a really good example
of the descriptive statistics,
is that they take all my grades and then
they add them all up and then
divide them by the number.
And so I can kind of get a sense of where
my grade is kind of falls in between.
Now, the median is the number that's
in the middle of a distribution,
so you'll have,
you know, five numbers or something like
that and it's the one in the middle.
And you can kind of get a different sense.
If you're describing these trying
to describe these five numbers,
you can get a different answer based
on whether you're using
the median or the mean.
So a good example is
sometimes I go down to Florida,
my parents have a place there.
you know, they one of the things that they
have in this little place that my parents
live is this little country club.
And so if I went in and there were three
people that have a little bar
in the country club,
and so if there were three stools
in the country club and three people were
sitting at those stools,
let's say that one person made
fifty thousand dollars a year.
Another person made one hundred
thousand dollars a year.
And then the third person made like
a billion dollars a year,
something like that.
So they have this
at the at the club there.
So if you think about that in terms
of the mean, if I took the mean of those
numbers, so I added them all up
then and then divided them by three
because there's three people sitting
in the stools, then the median
the median income of all those three
people is three hundred
and thirty three million dollars.
You know, that you would think
to yourself, oh, my gosh,
these three people are incredibly wealthy,
they're all they all of them have,
you know, roughly about three hundred
and thirty three million dollars,
even though, you know,
one person is making fifty thousand
dollars a year, which is not bad,
but pretty good, but wouldn't be
considered enormously wealthy.
Three hundred and thirty
three million dollars.
And the person in the middle is making one
hundred thousand dollars a year,
you know, good management salary.
But again three hundred and thirty three
million dollars is a lot if you sort
of assume that person makes that much
and it's really that third person
who makes a billion dollars
who's skewing the results.
So if you see this kind of bell curve
type shape that I'm showing you,
it's so if you'll see that the the this
person who made a billion dollars sort
of skewed the results
to the right a little bit.
So they skewed it over so that everybody
seems like they're much
wealthier because you have this
billionaire sitting on the third stool.
So if I took these same three people
and then I tried to look at the median,
then I would say, OK, one person
makes fifty thousand dollars.
One a person makes one hundred thousand
dollars and one person
makes a billion dollars.
I line them all up and then the median
income is one hundred thousand dollars.
So, you know, that, again,
is descriptive of the three people.
But, you know,
if I found out that the third person was
a billionaire, I'd be like, OK, well,
that doesn't really describe them.
So you have to when you look think about
the meeting in the meeting,
when you're working with data science
teams, you have to think about which
better describes the group of numbers.
So in this case, you have three people
and their main income is three hundred
and thirty three million dollars.
So that seems kind of confusing because
you have one person makes one hundred
thousand and one person
who makes fifty thousand.
Now, you know, again,
is it a better description to talk
about the median than you have?
One hundred thousand dollars,
but then you've got a billionaire
in the room, you're like, well, that's way
more than one hundred thousand dollars.
So you have to think about
which one's more descriptive.
When you're working with
your data science team.
You have to think about,
you know, work with them to sort of figure
out which one might tell a better story.
And you see this a lot of times
with politicians, as you'll you'll say,
one politician will,
if they give a massive tax cut to sort
of the wealthiest people will say,
OK, well, our main income went up.
Now everybody makes three hundred
and thirty three million dollars.
And then someone who competes against them
might say, well, actually, you just gave a
huge tax cut to this billionaire person.
The median income hasn't moved at all.
It's still a hundred thousand dollars.
So you'll see, like politicians and other
people will play with this kind
of descriptive statistic to try
and tell a better story.
So when you're working with data science
team, kind of have to understand this
challenge with the median in the mean if
you're working with
Another thing that you'll probably be
working on, remember, in the in the
definition of statistics is probability.
Probability is the likelihood
that something will happen
and it ranges from high to low.
And so what are the challenges that a lot
of people have with probability is
probability understand that probability
is an expression of uncertainty.
And again, you see this a lot in politics
as well as restfully around polling.
So if if one person if one politician has
a 70 percent chance of winning,
then people think to themselves, oh, well,
that politician is going to win because
they have a 70 percent chance of winning.
But that's not really what
probability is saying.
Probability is saying that it's
70 percent certain they will win.
That means that if they're when they did
the polling and they did their modeling,
that means that 70 out of 100 times
they won, but that 30 out
of 100 times they didn't win.
So it's certainly probable
that they won't win.
And so a lot of people think of, well,
70 percent then it's done, you know,
game over because it's 70 percent chance.
And they don't really think that.
OK, well, this is really just
70 out of a hundred times.
This was the outcome that would happen.
But 30 out of 100 times
this outcome didn't happen.
And so that's kind of a different
way to think about it.
So when you're working with your data
science team and they start to talk to you
don't just assume that if it's like 80 or
90 percent that they know you don't even
have to worry about it because, you know,
80 out of 100 times this happened,
but 20 out of 100 times it didn't.
And you see this also with hurricanes
and things like that where
they say, well, you know,
there's a 70 percent chance,
probability that this will the hurricane
will hit landfall and people say, OK,
you know, oh, my gosh,
it's certain it's going to hit.
And then when it doesn't or it goes
somewhere else, they're like,
well, you were wrong.
And so but the statistician wasn't wrong
with the data science team wasn't wrong.
They were just saying that 70
out of 100 times as happened.
But 30 out of a hundred times it didn't.
And again, you see this a lot
with polling, people will say, well,
you know, so-and-so had a fifty
six percent chance of winning.
So you said they were going
to win and they didn't.
And so a lot of people have trouble
with that, with probability.
When you work with your data science team,
keep that in mind that if if it's there's
a fifty six percent probability, it's it
doesn't mean that it's going to happen.
It means that fifty six out of 100 times
it happened, but forty four
out of 100 times it did.
So you don't really know what
the outcome is going to be.
There's only a marginal
more, it's only marginally more likely
one thing will happen than the other.
So remember, not get
to confuse with probability.
I've worked with a lot of teams when they
explain explain probability to other
people in the organization.
They have a big challenge with this.
They said, well,
you said there was a sixty seven
percent chance this is going to happen.
So I never thought about it again.
And so in reinforced, they said, well, no,
I said that sixty seven out of 100 times
it would happen,
but the other times it wouldn't.
So probability can be tricky.
But think about that with
your data science team.
Another thing that data science teams work
a lot with is correlation, this is when
variables correlate with one another.
So you have this cluster,
you'll see this with
with movie recommendations or
with with when pharmaceutical companies
are testing drugs,
it's when you have a bunch of data points
and you want to see them
kind of clustered together.
Now, there could be sort of what's called
a positive correlation,
where you'll see in the line here
that everything is kind of
in a pretty much a straight line.
So you'll have data points
that are clustered together.
And there can also be a negative
correlation where all the data points are
clustered together, working
in the opposite direction.
So if you think about movie
you know, when you watch something I
was I was on an airplane recently.
I ended up watching an old movie that I'd
never seen before
called Breakfast at Tiffany's, which was
Audrey Hepburn was it was good.
And so if I went to Netflix and I watched
Breakfast at Tiffany's,
then they would look at the other data
points, the other people who watch that as
well and see other movies that they also
watched and then try and correlate
the data points together
so that if someone like me watches
Breakfast at Tiffany's and then.
You know, someone else also watch practice
activities and also ended up
liking a movie called Serendipity.
Then there's a pretty good chance that I
would end up liking a movie
So you make that sort
of positive correlation.
Now, if that person who watched Breakfast
at Tiffany's also ended up
downloading Armageddon and watching it
but hated it,
then you could make the negative
correlation that, you know, between
sort of breakfast at Tiffany's and people
who liked watching, you know,
Armageddon or something like that.
And so, you know, your data science teams
will try to look to create these
correlations so that you can make
people's behavior or if you can just
kind of gather the data together.
Now, one of the big
challenges that you'll see with the data
science teams is an is similar
to probability is that correlation
doesn't necessarily imply causation.
like I remember reading an example about
this is that there was an organization
that was trying to figure out how to get
people sort of to be healthy or at work.
So they kind of had these people,
they had these little Iot devices,
Internet of Things,
devices that can attract people
in the office to see if,
you know, if they got up and moved around
more, if their health care
costs would be lower.
And what they found is that people who got
up every hour or so from work and went
outside for five minutes
ended up having much higher health costs.
And so, you know, that kind of seems
like a counterintuitive result.
You would assume that someone who moves
around a little bit more and goes outside
would be, you know,
have a healthier outcome.
And so they kind of looked at the data
and they said, OK, well,
I guess people shouldn't move as much.
You know, the correlation between people
moving and going outside for five minutes
and their health caused their
health care costs to go up.
But if anybody recognizes this,
if you worked in office,
a lot of people who get up every five
minutes and go outside
are actually smokers.
And so if you're getting up
for five minutes and going out and smoking
and coming back,
the smoking is mitigating the health
benefits of going outside
for five minutes.
So the correlation between people getting
up for five minutes and going outside
wasn't the causation of them having
their health care costs go up.
The causation was the reason.
It was because they were smoking.
And you see that a lot.
You know, there was a huge health care
crisis a couple of years ago where people
thought that estrogen
taking estrogen shots was beneficial
to women's health because they saw
a correlation between estrogen shots in
better health outcomes for women.
And so they had to take these estrogen
shots and they actually turned out
to be very unhealthy for women.
But the correlation was,
is that the women who got estrogen shots
were more likely to have higher incomes.
And and that's why they were better off.
And but the estrogen shot itself
was actually harmful to people.
And so when you see correlation and
causation all the time, you know, if the
a lot of people
say that there's a correlation between
the temperature going up
and crime rates going up.
And so there might be the cause,
it might be because when people get
hotter, they are more
inclined to do crime.
But it's probably more likely that since
it's warmer, they're more
likely to be outside.
So so you have to watch out for kind
of correlation and making what's called
spurious causation arguments is
that you're sort of you're you're
attaching the wrong causation to it.
When you're working with your data science
team, watch out for this this
correlation and causation challenge.
Now, one thing that you'll see a lot
with your data science team are if you're
working in design,
seems to work a lot with something
called predictive analytics.
And this is combines a lot of statistic
statistical techniques that we've
already talked about, probability,
correlation, things like that,
and a descriptive remember the descriptive
statistics and tries to predict
what people will do in the future.
So obviously, this is of a lot of value
to to to many large organizations.
If you can predict when your customer is
going to buy your product,
you can have it there on time.
If you can predict when your customer is
more likely to buy, you can
focus more on advertising.
So a lot of emphasis.
The last few years with data science team
has been focused on predictive analytics,
sort of trying to use these different
statistical techniques to predict people's
future behavior so that you can kind
of meet them where they're going to be.
So when you're working with your data
science team, if you're on a data science
team, you should be familiar
with this term predictive analytics.
And you should think of it as these
these different statistical techniques
combined together to try to predict
customers, future behavior,
anyone's future behavior.
You know, if you predict the weather,
you could predict whether or not people
are going to visit your website.
If you do a certain thing
when you're doing this correlation,
if you could predict when people are more
likely to watch a different movie,
then you can, you know,
obviously buy some licensing
for that movie so that you make sure that
people can watch it without any problems.
So predictive analytics is a big thing.
And it's it's been big
with data science team.
So you should be familiar with that term.
I've finally I want to go a little bit
over regression analysis,
you'll see this also with data science
teams, and this is when you have sort
of two variables and you try to create
a see if there's a relationship
between those variables.
One of the classic examples is that,
you know, taller people end up being
heavier because they're taller.
So those two variables, the height
and the weight are closely connected.
And you can see sort of a regression
there with the straight line.
And so you'll see these
kind of regression tables.
People try to do regression
analysis all the time.
When you're working
on a data science team,
try to make a connection between how much
a customer spends and how much they're
on their website and try to make
sort of a clear regression.
So I'm not going to go too deep
into regression because it
can get a little tricky.
But I want you at least to be familiar
with what regression analysis is as
the sort of the connection between two
variables dependent and an independent
variable and trying to show
the trend lines between those.
And so you would if you looked at this
regression table, you assume that
that as someone gets taller,
that they would also tend to be
heavier, they would weigh more.
So and then you can kind of try to figure
out how close it is
to the trend line there.
That's what that line is going up.
The final thing I want to go over before
we start talking a little bit about teams
is, is this different scene
samples and populations.
The big push around big data is been
around sort of being able to work
with massive sample sizes.
when you're working with an organization,
if you work in a data science team,
in an organization, a lot of times you
will have sort of a limited set of data.
You might not have all the data on what
movies people are watching or,
you know, you might not have all
the data about what their income is.
And so you can take a sample of that data,
find a small group of that of people
that you think are going to be
representative of that larger group and do
some analysis of based on that sample.
Now, one of the big things about big data
is as you're able to collect these massive
data sets, you can start
to work with whole populations.
So think about polling data
when there's an election coming up.
Typically, people don't go
out and talk to every voter.
What they'll do is they'll find a sample
size that's representative of what they
think the voters in a certain city
or district or state will vote.
So I'm here in Atlanta, Georgia.
So if there's an election for mayor or
something like that, they don't
pull everybody in the city.
What they'll do is they'll find, you know,
a thousand people, five thousand people,
fifty thousand people in the city,
and they'll use that as a sample
and then do some statistical analysis
on them and then make generalizations
about the entire city.
Not with big data.
You're able to sort of collect massive
amounts of data, entire you can work
with entire populations sometimes.
And so if you know,
if you're working with social media or
something like that and you're taking
in Facebook information,
then you are Facebook data from Facebook,
unstructured data from Facebook,
then you might be able to deal with like
a million people or two million people are
just the entire population of a group.
So you should think about that when you're
working with a data scientist team,
the different gene samples taking a small
group and populations where you're working
with an entire group and doing
OK, so we've talked a little bit about
we've talked a little bit about how
to sort of organize,
how you can work with that data
and databases, then we talked a little bit
about statistics as a way
to kind of analyze that data.
So now I want you to think a little bit
about how you can change your organization
to take advantage of some of this data,
take advantage of the analysis to sort
of change your organization so that you
can do the science and data science,
run these experiments.
Now, a lot of organizations aren't
structured very well for the scientific
aspect of data science, you know,
doing sort of little small
running these little small tests
and experiments, having these teams.
And so you should kind of try to
do to start to change your
organization to to take advantage of
of sort of the empirical or
the questioning approach that you need to
have to do the scientific method
to run these experiments in your data.
And one of the best ways I've seen to do
that is to kind of start having these
question only meetings about your data,
where you're sitting around in a meeting
and you're trying to sort of ask
questions about your data.
No one in the meeting
should make any statements.
And you should think about the questions
that you want to ask about your data.
Don't try to answer that
question in the meeting.
Just look for interesting questions. So,
you know, if you have a massive amount
of data about people's credit card
maybe ask an interesting question like,
well, what can we learn about our customer
through these credit card transactions?
Can we learn if they're happy?
Can we learn if they're
going to spend more?
Can we learn if they're going
to use a certain promotion?
So if you ask questions like that in these
question only meetings, it'll help you.
It's almost like a hypothesis when you're
using the scientific method where you're
asking questions and trying
to do something interesting.
So first thing you should do at your
organization is start having these
kind of question only meetings.
Now, one of the best ways to to work,
to take these questions and also to deal
with these questions
in your organization is to have
something called a question board.
And this is where you sort of organize
your questions in a way
so that you can make people feel
that the questions that they ask in these
meetings are going to get results.
So you get these questions,
you put them up on a board,
and then you organize these
questions into different groups.
And then you let
the data analysts from your data science
teams pull these questions off the board
and see if they can do a little bit,
run some experiments off them,
see if they can get some results from
them, do something interesting with them.
So, you know,
this is really good way to organize your
questions, and it's also a really good way
to have people feel like when they're
doing asking questions and they're
question only meaning that it's
going to lead to something.
So they're not just asking questions
for the sake of asking questions.
Now, one of the things that you want to do
when you're asking questions and putting
them on the board
or pulling them off the questions off
the board and your data science team is
when people are asking these questions,
try not to think about the data.
Don't worry if you have the data to ask
the questions and try to stay away
from yes or no questions, you just want
to focus on something interesting.
Try to think about second order questions,
something like, you know,
instead of thinking about why are
customers spending less than Jilli, think
about why would a customer spend less.
Think about sort of like these bigger
questions and why do we know or what do we
know about how a customer spends
or why a customer spends?
So these are sort of bigger questions
that will let your data science team run
some small experiments and try to come
up with some interesting results.
There's an old joke in statistics
that you shouldn't use.
Statistics is the same way that someone
who's drunk uses a lamppost.
You shouldn't use statistics for support
of something you already believe.
Instead, you should use it
something to find something new.
So question only meetings help you sort
of ask these questions and it helps you,
helps your organization,
your team figure out what they don't know
and figuring out what you don't know is
one of the most productive ways to have
your data science team sort of run
experiments and do something interesting.
Now, your data science team,
what I've seen that works the best is sort
of a small team based a lot
on the scientific method.
Instead of going out and trying to find
a data scientist and paying two hundred or
eight hundred thousand dollars a year
for kind of a unicorn data scientist,
instead you should create a small team,
sort of like almost like a small team that
that uses the scientific method where
they're inventing and asking
What I've seen that works well is kind
of a data analyst,
someone just out of grad school that knows
statistics, knows how to run reports,
knows how to work with the data.
Then a role that I call a knowledge
explorer or some organization is called
which is someone who runs the question
meetings, is really interested
in getting good questions.
The data analyst
sort of really interested in using
the scientific method to ask something
interesting and then sort of a servant
leader role, which is someone who's
responsible for kind of getting that some
of the results of the experiments out
to the rest of the organization
do some more practical things like make
sure that the data analyst has access
to database servers,
can migrate data from one server
the next that type of thing.
So these very small teams I found have
done a really good job at asking
doing things that are interesting with
the data in these in these small teams.
I found work better than just trying
to sort of put all of your money in one
person spending two or eight hundred
thousand dollars a year finding a really
high end data scientist that
that could try to do all these things,
instead, have a small team of three people
and have them do interesting
things with the data.
So, OK, we're pretty much out of time now.
I hope you enjoyed this webinar,
but it's I want you to have
these five takeaways from
the time we spent together.
The first is your data science will depend
on the types of data
that you're collecting.
So you should understand
the different types.
You should understand the structured,
semi structured, unstructured data.
There's many different
ways to manage your data.
Remember, we talked about the relational
database management systems,
the old IBM systems, the no
sequel clusters, Cassandre.
So there's different ways to manage your
data and you kind of should be familiar
with all of them and the different
pluses and minuses to use each.
You can use statistics
to explore your data.
That's really key.
Remember, descriptive statistics,
probability correlation, things like that.
Remember, correlation doesn't imply
causation and then use statistics
to kind of explore your data.
And then you should kind of in your or
exchange your organizational structure
to make guesses and run
experiments about your data
so that you can kind
of get more out of it.
And so you want to change your
organization instead of focusing
on things, you know, and proving things,
you know, instead be more scientific,
more exploratory and make
guesses and run experiments.
And finally, I want you to sort of think
about creating these teams,
these small teams of people that can get
your organization comfortable with asking
questions, a lot of organizations are
very comfortable asking questions.
And, you know, it's a change.
The structure is one of the key parts
of having a successful data science team.
So I hope you enjoyed this presentation
and good luck structuring your data
science team and finding out
interesting things about your data.