See what it takes to succeed on a Data Science Team

video
 
Hello and welcome to this introduction to Data Science webinar. My name is Doug Rose, so I hope you enjoy this presentation, this webinar on introduction to data science. So let's get started. Data science is using a scientific method to explore the data. That's a really key concept to this entire webinar, is it's using the scientific method, you're exploring the data. You're looking for things. You're not sort of you don't really know what you're going to find until you start kind of going through the data. And that's kind of the science and data science. It's about the scientific method. So let's go through kind of like the different types of data. The first is structured data structure data is kind of traditional data is one of the oldest data types, and then there's semi structured data, which has some aspects of structured data, but also some aspects of what you'll see now, unstructured data. And then there's third, there's sort of unstructured data. So now let's go kind of let's figure out what the differences between these three types. Structured data has sort of clear, predefined, detailed format, so you'll have you'll know what's going in and you'll know what to expect before the data gets populated in your database or data cluster or whatever has a specific format and it comes in a specific order. So that's all great. But let's think about how that breaks down. So think about something like an address book, which is a pretty good example of structured data. You have a really good sense of the kind of data that's going to end up in your address book. You have a first name and that's going to typically be text and you have a last name and that's also going to be text. People don't typically have a last name like 12345. So, you know what's going to come in, you know how it's going to be labeled and you know what kind of format it's going to be in and it's going to come in a specific order. So you're going to have like, you know, typically people will type in their first name and then they'll typically type in their last name. So it's going to have this order to it. Another way to think about it, something like a recipe book. You know, again, you'll have something that comes in a very specific order. You'll have kind of ingredients and then you'll have metrics like one cup and then two cups or something like that. So you'll know what to expect and then they'll kind of be the text of the recipe. So you could think of this really is structured data and there's a lot of data out there that's structured. If you think about credit reports, if you think about, you know, address books, things like that, a lot of people work with structured data. Almost everything that you work with in a spreadsheet or something like that is structured data. You know, you have your columns, you have your rows, you have your format. You kind of know what's going to go into it. So there's a lot of data out there in that structured, but there's also a lot of semi structured data. And it's gotten a big boost the last couple of years because of things like weblogs or something like that, where you have the data. It has a logical flow and format, but it's not user friendly. So if you've ever seen a weblog or something like that, it's got the date, it's got the time. But then pretty much anything that comes after it could be kind of unstructured, whatever. You know, you could have a long text that describes somebody logging in. It could be useful. It could be a ton of different stuff. So on Semih structured data, you've got some elements of structure like the address book, but you've got a little bit of unstructured data, you know, some text or something like that. And if you've ever worked on a Web server, if you've ever had any logging for servers or even on your computer, you'll see that they can get huge really quickly. So it's kind of tough to get get through it. Not not very many people can kind of look through a weblog and immediately see what might be wrong with the server. So you'll need some sort of program or something that turns it into a graph or something pretty so that you can kind of understand what it is. So, again, that's a difference in the structured data where you have your address book where anybody can kind of look and understand it, and then you have your SEMIH structured where you really have to take some effort in understanding what's in the data, get some sort of visualization or something like that. So you could see a chart or understand what what's in the log or in this semi structured data. Now, the third type is unstructured data, now unstructured data is typically called schema lists, and we talked about the address book is structured data and you have a first name and a last name and an address that is typically called the database schema. You kind of understand what's going to be in the database. So you can you can label the rows and you kind of understand what's going to come in there with unstructured data. You don't have this kind of schema. Could be anything. It could be just some random text. It could bean image. And so a very good example of unstructured data is something like social media. If you look at your Twitter feed or something like that, you'll see that there's some videos, there's some texts. And the kind of text that people use isn't very structured. It could be a question. It could be broken into two parts. It could be filled with emojis, it could be text and then a video or even audio. So, you know, that's unstructured data. And so you'll see there's varying types of data, audio, video, different types of video. And it could come in different formats. I mean, you could have an MPEG three audio file or an ad for a audio file or something like that. Or you could have with video, you could have MPEG two. It's unlikely, but you could have it or MPEG four, it could come in different formats. The the text could be formatted in a different way. And so it's more difficult and sometimes it could even be in different languages or something like that. So unstructured data is everywhere. And with social media and people uploading videos to YouTube all the time, it's only increasing. Now, one of the challenges with data science is that 80 percent of the data out there is unstructured data. And this is the big trend a couple of years ago in Big Data know that there was these massive amounts of new data that was flooding people's servers and kind of trying to figure out what to do with it. And Big Data is a little bit related to data science, because a lot of data science uses big data, uses kind of unstructured data, massive petabytes of unstructured data and tries to get some value from it. So something like YouTube is a really good example of big data that's unstructured. We have millions of people who are uploading videos every day in some audio and things like that and could have different formats and different structure. And the big challenge that Google has is to try and figure out how to categorize it, what to do with it. And so, you know, that's one of the big challenges with data science teams, so you have these three types of data structured, semi structured and unstructured. So now one of the big challenges is now in data science teams is that you have these three types of data. You have to figure out what to do with it. How are you going to analyze it? Remember, data science is about kind of using a scientific method to understand your data. So you kind of you're going to run experiments, you're going to look at the data. You're going to try and figure out how to extract value from the data. So to do that, you need to analyze it. And one of the. And so one of the ways that you analyze it is you can put it into sort of a database management system. So there's different types of databases. And that came up in the nineteen sixties, a lot around the space program that started out with IBM's information management system. And then later on, there were the relational model of databases. Most organizations, probably your organization has a relational database that RDBMS relational database management system. Maybe it's like my school or Microsoft school or something like that at school stands for structured query language. This is kind of one of the ways that you extract data from these databases. It's a language to run reports into to try to get joint tables and things like that so that you can mix and match your data and extract value from it. Two thousand for the work they came up with. Google came out with something called the Big Table, which was one of the big impetus for big data is that there was a big data, big table that was replicated with a base. And it it was a way to sort of put massive amounts of unstructured data into sort of a management system. And then later on, you have sort of Cassandre, you probably have Cassandre in your organization and no sequel, which is a sort of a cluster, a way to manage unstructured data in a massive cluster. And and it's so you can sort of take any data and put it into the cluster. So most of these clusters are working with unstructured data. You'll have, you know, clusters of servers that pull in video and audio and try to manage it. So these are sort of like the different databases, different way to to manage your data. So you should sort of be familiar with them if you're going to work with data science teams, is that these are they go all the way back from nineteen sixty six where you have this kind of structured data database with Eben's information management system. Then you have relational databases that can take different tables, almost like spreadsheets, and link them together and run reports. And then you have later on and you do that with skill or structured query language. Then later on you have big table, which starts to work with unstructured data and Cassandre and no sequel. So but when you're working with typical databases, you know, some of the oldest ones, you'll have something that looks a lot like this. You'll have sort of a database table. And this is very similar to the address book where you have a first name, last name, date of birth. You kind of know what to expect. Social Security number, you know, that's going to be a number and not something like, you know, a word place of birth. You typically know that's going to be text and date of birth, typically know the format. And so this is very structured data. So it's easy to manage. Now, what relational table relational database management systems do is they they take these different spreadsheets or tables and link them together so that you can sort of create reports, does this through a structured query language. And so this is a lot of this is often what this looks like is a very simple database where you have a series of different tables with structured data and then you link them together as a way to kind of manage your data, as a way to to create reports or views or to look at it and try to understand it. So your data science team will will most likely be working with some relational databases, some structured data, and it will look a lot like this. And then they'll usethe data. Science teams will typically also use no sequel. And so you might have Cassandre, like I said, and these use kind of typical what's called key value stores. This came out in nineteen ninety eight. It's a key value stores is a basically a way to kind of create a key. So a name for something you could just call it sort of like video and then put like a value in it. But it could be anything. The these are schema lists. Remember I talked about how databases typically have schemas. We have like first name, last name. And so you kind of have to know what's going to go into your database before you start collecting data. So imagine, you know, in a relational database, remember, we looked at all those different tables that are connected together. If you want to add some data, let's say that you did first name, last name and started collecting data. And you're like, oh, my gosh, we have to collect people's middle names. Then you have to update the schema and either go back and try to collect people's middle names or accept the fact that some of the people in your database are going to have their middle names and some people are not. And so that was kind of a big challenge with relational databases. Not only sequel doesn't have this challenge as much because it's seamless. It just uses these key values. There's kind of a key and then there's a value attached to it. And so you can kind of expand your data as much as you need to. And so a lot of times this is called if you and if you're collecting massive amounts of data, one of the advantages of no skill or not only skill sometimes it's called, is that you can continue to add servers. So one of the big challenges with a relational database management systems is that in scale as well for these massive data collections, because you have to as you as you saw, you have to sort of connect these different tables together. And so as you start to collect massive amounts of data, you have a real challenge in scaling that out with not only skill, you have something called horizontal scaling where you just kind of throw more servers in and then it's sort of cross replicates data among the different servers. So not only SQL or no SQL scales, typically better, it's more flexible because you just have these key value stores. You don't have to deal with database schemas like you do with relational databases. It doesn't work as well for structured data because if you know what's going to go into your database, there's not really that much of a need for using sort of a no SQL cluster. But if you're going to do something like collect video, audio, text, or if you want to collect social media streams or something like that, then no sequel is really kind of the only way to go. It's really difficult to do that with relational database because you have to know you have to create a schema. So you have to know all the data that's going to go into it before you start collecting data. So, you know, no sequel's been a really big thing for data science. It's was also a really big thing for big data because it gives these data science teams a lot more room to explore. You have you can collect a lot more data and you can run some really interesting reports sometimes. So people consider it's kind of a downside is that since it's such a massive data set, a lot of times, no, SQL clusters aren't really good for quick transactions. Like you wouldn't want your bank to use something like no sequel because it's got something called eventual consistency, which means that if you were using this for your bank and you like withdrew money, that it wouldn't instantly show the update in your database. It was kind of eventually make it consistent. So it takes a little bit longer. So a lot of banks, a lot of financial institutions still use relational databases. But for for a couple of reasons. One is because almost all the data they receive is structured. It's your name, it's your amount in your account. It's how much you withdrew, how much you deposited. That's. But the other reason is because, you know, you're it's it's it's a little bit slower, you're not going to have the you want to have this immediate transaction when you're working with banks and things like that. So now that you've seen sort of the data science teams, you've seen the types of data, you've seen the different ways that data can be stored in a database, if you're working with the data science team, you're going to want to be familiar with the different ways, the different kind of servers and technology that your team is going to be using or the team. Or if you're working on a team, you obviously want to be familiar with some of this stuff. So now let's think a little bit about the science. What's the science part of data science and the science part? Most of the main tool you're going to be using for the science and data science teams is statistics. And this is an old Mark Twain quote that there's three kinds of lies, lies, damn lies and statistics. And you know, what he meant by this is that you can use statistics to kind of people tend to think of statistics as pure math. And so a lot of times you can sort of fudge things with statistics and make it look like the truth. And so you have to watch out for that with your data science team and we'll see a little bit more of that later on. But statistics is the definition of statistics is the science deals with collection, analysis, interpretation of the numerical data, often using probability theory. So so, you know, we'll learn a little bit. We'll see. You'll see a little bit about probability theory later. But, you know, and you've seen a little bit about the collection and analysis of data. Now we're going to think a little bit about the interpretation. Now, we're kind of always using statistics, we don't really think about it, so, you know, my son is is in school and so we're always watching his GPA and that's a statistic. You know, people who are sports fans are always working with statistics. You know, a lot of times the statistics give us comfort. You know, my wife doesn't like to fly very much, especially if there's turbulence. And so I try to remind her that statistically this is really the safest way to get to where we want to go. It doesn't help. But, you know, that's a way to kind of use statistics to think about it. And so we people use statistics all the time. If it if you're in the political season, in your right right now, what whatever in your state or country that you're in, then you'll see lots of statistics, you know, jobs reports and some politicians will say, you know, our unemployment went down and then another competing politician will say, yes, but income, you know, also went down. And so people use statistics to try and kind of tell their version of the truth. And this can sometimes be a dangerous thing. But one of the the most common statistics that you'll see even with your data science team is something called descriptive statistics. And so the descriptive statistics is typically the two tools that you'll see the most are the the the median in the mean. And they're trying to describe large groups of numbers. And so that's why it's called descriptive statistics. And that meeting, the mean challenge is one that you'll see a lot in data science teams. In fact, you can take that same group of numbers and it might mean something different to you based on whether you use the median or the mean. Now, the the mean is typically sort of you can think what is the average is the group of numbers, and then you multiply them up and then come up and then divide them by the number of values that you have. And you can come up with kind of an average something that represents generally describes the whole group. And a GPA is a really good example of the descriptive statistics, is that they take all my grades and then they add them all up and then divide them by the number. And so I can kind of get a sense of where my grade is kind of falls in between. Now, the median is the number that's in the middle of a distribution, so you'll have, you know, five numbers or something like that and it's the one in the middle. And you can kind of get a different sense. If you're describing these trying to describe these five numbers, you can get a different answer based on whether you're using the median or the mean. So a good example is sometimes I go down to Florida, my parents have a place there. And, you know, they one of the things that they have in this little place that my parents live is this little country club. And so if I went in and there were three people that have a little bar in the country club, and so if there were three stools in the country club and three people were sitting at those stools, let's say that one person made fifty thousand dollars a year. Another person made one hundred thousand dollars a year. And then the third person made like a billion dollars a year, something like that. So they have this at the at the club there. So if you think about that in terms of the mean, if I took the mean of those numbers, so I added them all up then and then divided them by three because there's three people sitting in the stools, then the median the median income of all those three people is three hundred and thirty three million dollars. So. You know, that you would think to yourself, oh, my gosh, these three people are incredibly wealthy, they're all they all of them have, you know, roughly about three hundred and thirty three million dollars, even though, you know, one person is making fifty thousand dollars a year, which is not bad, but pretty good, but wouldn't be considered enormously wealthy. Three hundred and thirty three million dollars. And the person in the middle is making one hundred thousand dollars a year, you know, good management salary. But again three hundred and thirty three million dollars is a lot if you sort of assume that person makes that much and it's really that third person who makes a billion dollars who's skewing the results. So if you see this kind of bell curve type shape that I'm showing you, it's so if you'll see that the the this person who made a billion dollars sort of skewed the results to the right a little bit. So they skewed it over so that everybody seems like they're much wealthier because you have this billionaire sitting on the third stool. So if I took these same three people and then I tried to look at the median, then I would say, OK, one person makes fifty thousand dollars. One a person makes one hundred thousand dollars and one person makes a billion dollars. I line them all up and then the median income is one hundred thousand dollars. So, you know, that, again, is descriptive of the three people. But, you know, if I found out that the third person was a billionaire, I'd be like, OK, well, that doesn't really describe them. So you have to when you look think about the meeting in the meeting, when you're working with data science teams, you have to think about which better describes the group of numbers. So in this case, you have three people and their main income is three hundred and thirty three million dollars. So that seems kind of confusing because you have one person makes one hundred thousand and one person who makes fifty thousand. Now, you know, again, is it a better description to talk about the median than you have? One hundred thousand dollars, but then you've got a billionaire in the room, you're like, well, that's way more than one hundred thousand dollars. So you have to think about which one's more descriptive. When you're working with your data science team. You have to think about, you know, work with them to sort of figure out which one might tell a better story. And you see this a lot of times with politicians, as you'll you'll say, well, one politician will, if they give a massive tax cut to sort of the wealthiest people will say, OK, well, our main income went up. Now everybody makes three hundred and thirty three million dollars. And then someone who competes against them might say, well, actually, you just gave a huge tax cut to this billionaire person. The median income hasn't moved at all. It's still a hundred thousand dollars. So you'll see, like politicians and other people will play with this kind of descriptive statistic to try and tell a better story. So when you're working with data science team, kind of have to understand this challenge with the median in the mean if you're working with descriptive statistics. Another thing that you'll probably be working on, remember, in the in the definition of statistics is probability. Probability is the likelihood that something will happen and it ranges from high to low. And so what are the challenges that a lot of people have with probability is probability understand that probability is an expression of uncertainty. And again, you see this a lot in politics as well as restfully around polling. So if if one person if one politician has a 70 percent chance of winning, then people think to themselves, oh, well, that politician is going to win because they have a 70 percent chance of winning. But that's not really what probability is saying. Probability is saying that it's 70 percent certain they will win. That means that if they're when they did the polling and they did their modeling, that means that 70 out of 100 times they won, but that 30 out of 100 times they didn't win. So it's certainly probable that they won't win. And so a lot of people think of, well, 70 percent then it's done, you know, game over because it's 70 percent chance. And they don't really think that. OK, well, this is really just 70 out of a hundred times. This was the outcome that would happen. But 30 out of 100 times this outcome didn't happen. And so that's kind of a different way to think about it. So when you're working with your data science team and they start to talk to you about probability, don't just assume that if it's like 80 or 90 percent that they know you don't even have to worry about it because, you know, 80 out of 100 times this happened, but 20 out of 100 times it didn't. And you see this also with hurricanes and things like that where they say, well, you know, there's a 70 percent chance, probability that this will the hurricane will hit landfall and people say, OK, you know, oh, my gosh, it's certain it's going to hit. And then when it doesn't or it goes somewhere else, they're like, well, you were wrong. And so but the statistician wasn't wrong with the data science team wasn't wrong. They were just saying that 70 out of 100 times as happened. But 30 out of a hundred times it didn't. And again, you see this a lot with polling, people will say, well, you know, so-and-so had a fifty six percent chance of winning. So you said they were going to win and they didn't. And so a lot of people have trouble with that, with probability. When you work with your data science team, keep that in mind that if if it's there's a fifty six percent probability, it's it doesn't mean that it's going to happen. It means that fifty six out of 100 times it happened, but forty four out of 100 times it did. So you don't really know what the outcome is going to be. There's only a marginal more, it's only marginally more likely one thing will happen than the other. So remember, not get to confuse with probability. I've worked with a lot of teams when they explain explain probability to other people in the organization. They have a big challenge with this. They said, well, you said there was a sixty seven percent chance this is going to happen. So I never thought about it again. And so in reinforced, they said, well, no, I said that sixty seven out of 100 times it would happen, but the other times it wouldn't. So probability can be tricky. But think about that with your data science team. Another thing that data science teams work a lot with is correlation, this is when variables correlate with one another. So you have this cluster, you'll see this with with movie recommendations or with with when pharmaceutical companies are testing drugs, it's when you have a bunch of data points and you want to see them kind of clustered together. Now, there could be sort of what's called a positive correlation, where you'll see in the line here that everything is kind of clustered together in a pretty much a straight line. So you'll have data points that are clustered together. And there can also be a negative correlation where all the data points are clustered together, working in the opposite direction. So if you think about movie recommendations, if you know, when you watch something I was I was on an airplane recently. I ended up watching an old movie that I'd never seen before called Breakfast at Tiffany's, which was Audrey Hepburn was it was good. And so if I went to Netflix and I watched Breakfast at Tiffany's, then they would look at the other data points, the other people who watch that as well and see other movies that they also watched and then try and correlate the data points together so that if someone like me watches Breakfast at Tiffany's and then. You know, someone else also watch practice activities and also ended up liking a movie called Serendipity. Then there's a pretty good chance that I would end up liking a movie called Serendipity. So you make that sort of positive correlation. Now, if that person who watched Breakfast at Tiffany's also ended up downloading Armageddon and watching it but hated it, then you could make the negative correlation that, you know, between sort of breakfast at Tiffany's and people who liked watching, you know, Armageddon or something like that. And so, you know, your data science teams will try to look to create these correlations so that you can make assumptions about people's behavior or if you can just kind of gather the data together. Now, one of the big challenges that you'll see with the data science teams is an is similar to probability is that correlation doesn't necessarily imply causation. So like I remember reading an example about this is that there was an organization that was trying to figure out how to get people sort of to be healthy or at work. So they kind of had these people, they had these little Iot devices, Internet of Things, devices that can attract people in the office to see if, you know, if they got up and moved around more, if their health care costs would be lower. And what they found is that people who got up every hour or so from work and went outside for five minutes ended up having much higher health costs. And so, you know, that kind of seems like a counterintuitive result. You would assume that someone who moves around a little bit more and goes outside would be, you know, have a healthier outcome. And so they kind of looked at the data and they said, OK, well, I guess people shouldn't move as much. You know, the correlation between people moving and going outside for five minutes and their health caused their health care costs to go up. But if anybody recognizes this, if you worked in office, a lot of people who get up every five minutes and go outside are actually smokers. And so if you're getting up for five minutes and going out and smoking and coming back, the smoking is mitigating the health benefits of going outside for five minutes. So the correlation between people getting up for five minutes and going outside wasn't the causation of them having their health care costs go up. The causation was the reason. It was because they were smoking. And you see that a lot. You know, there was a huge health care crisis a couple of years ago where people thought that estrogen taking estrogen shots was beneficial to women's health because they saw a correlation between estrogen shots in better health outcomes for women. And so they had to take these estrogen shots and they actually turned out to be very unhealthy for women. But the correlation was, is that the women who got estrogen shots were more likely to have higher incomes. And and that's why they were better off. And but the estrogen shot itself was actually harmful to people. And so when you see correlation and causation all the time, you know, if the a lot of people say that there's a correlation between the temperature going up and crime rates going up. And so there might be the cause, it might be because when people get hotter, they are more inclined to do crime. But it's probably more likely that since it's warmer, they're more likely to be outside. So so you have to watch out for kind of correlation and making what's called spurious causation arguments is that you're sort of you're you're attaching the wrong causation to it. When you're working with your data science team, watch out for this this correlation and causation challenge. Now, one thing that you'll see a lot with your data science team are if you're working in design, seems to work a lot with something called predictive analytics. And this is combines a lot of statistic statistical techniques that we've already talked about, probability, correlation, things like that, and a descriptive remember the descriptive statistics and tries to predict what people will do in the future. So obviously, this is of a lot of value to to to many large organizations. If you can predict when your customer is going to buy your product, you can have it there on time. If you can predict when your customer is more likely to buy, you can focus more on advertising. So a lot of emphasis. The last few years with data science team has been focused on predictive analytics, sort of trying to use these different statistical techniques to predict people's future behavior so that you can kind of meet them where they're going to be. So when you're working with your data science team, if you're on a data science team, you should be familiar with this term predictive analytics. And you should think of it as these different descriptive, these different statistical techniques combined together to try to predict customers, future behavior, anyone's future behavior. You know, if you predict the weather, you could predict whether or not people are going to visit your website. If you do a certain thing when you're doing this correlation, if you could predict when people are more likely to watch a different movie, then you can, you know, obviously buy some licensing for that movie so that you make sure that people can watch it without any problems. So predictive analytics is a big thing. And it's it's been big with data science team. So you should be familiar with that term. I've finally I want to go a little bit over regression analysis, you'll see this also with data science teams, and this is when you have sort of two variables and you try to create a see if there's a relationship between those variables. One of the classic examples is that, you know, taller people end up being heavier because they're taller. So those two variables, the height and the weight are closely connected. And you can see sort of a regression there with the straight line. And so you'll see these kind of regression tables. People try to do regression analysis all the time. When you're working on a data science team, try to make a connection between how much a customer spends and how much they're on their website and try to make sort of a clear regression. So I'm not going to go too deep into regression because it can get a little tricky. But I want you at least to be familiar with what regression analysis is as the sort of the connection between two variables dependent and an independent variable and trying to show the trend lines between those. And so you would if you looked at this regression table, you assume that that as someone gets taller, that they would also tend to be heavier, they would weigh more. So and then you can kind of try to figure out how close it is to the trend line there. That's what that line is going up.

The final thing I want to go over before we start talking a little bit about teams
is, is this different scene samples and populations. The big push around big data is been around sort of being able to work with massive sample sizes. And so when you're working with an organization, if you work in a data science team, in an organization, a lot of times you will have sort of a limited set of data. You might not have all the data on what movies people are watching or, you know, you might not have all the data about what their income is. And so you can take a sample of that data, find a small group of that of people that you think are going to be representative of that larger group and do some analysis of based on that sample. Now, one of the big things about big data is as you're able to collect these massive data sets, you can start to work with whole populations. So think about polling data when there's an election coming up. Typically, people don't go out and talk to every voter. What they'll do is they'll find a sample size that's representative of what they think the voters in a certain city or district or state will vote. So I'm here in Atlanta, Georgia. So if there's an election for mayor or something like that, they don't pull everybody in the city. What they'll do is they'll find, you know, a thousand people, five thousand people, fifty thousand people in the city, and they'll use that as a sample and then do some statistical analysis on them and then make generalizations about the entire city. Not with big data. You're able to sort of collect massive amounts of data, entire you can work with entire populations sometimes. And so if you know, if you're working with social media or something like that and you're taking in Facebook information, then you are Facebook data from Facebook, unstructured data from Facebook, then you might be able to deal with like a million people or two million people are just the entire population of a group. So you should think about that when you're working with a data scientist team, the different gene samples taking a small group and populations where you're working with an entire group and doing statistical analysis. OK, so we've talked a little bit about data, we've talked a little bit about how to sort of organize, how you can work with that data and databases, then we talked a little bit about statistics as a way to kind of analyze that data. So now I want you to think a little bit about how you can change your organization to take advantage of some of this data, take advantage of the analysis to sort of change your organization so that you can do the science and data science, run these experiments. Now, a lot of organizations aren't structured very well for the scientific aspect of data science, you know, running experiments, doing sort of little small running these little small tests and experiments, having these teams. And so you should kind of try to do to start to change your organization to to take advantage of of sort of the empirical or the questioning approach that you need to have to do the scientific method to run these experiments in your data. And one of the best ways I've seen to do that is to kind of start having these question only meetings about your data, where you're sitting around in a meeting and you're trying to sort of ask questions about your data. No one in the meeting should make any statements. And you should think about the questions that you want to ask about your data. Don't try to answer that question in the meeting. Just look for interesting questions. So, you know, if you have a massive amount of data about people's credit card transactions, then maybe ask an interesting question like, well, what can we learn about our customer through these credit card transactions? Can we learn if they're happy? Can we learn if they're going to spend more? Can we learn if they're going to use a certain promotion? So if you ask questions like that in these question only meetings, it'll help you. It's almost like a hypothesis when you're using the scientific method where you're asking questions and trying to do something interesting. So first thing you should do at your organization is start having these kind of question only meetings. Now, one of the best ways to to work, to take these questions and also to deal with these questions in your organization is to have something called a question board. And this is where you sort of organize your questions in a way so that you can make people feel that the questions that they ask in these meetings are going to get results. So you get these questions, you put them up on a board, and then you organize these questions into different groups. And then you let the data analysts from your data science teams pull these questions off the board and see if they can do a little bit, run some experiments off them, see if they can get some results from them, do something interesting with them. So, you know, this is really good way to organize your questions, and it's also a really good way to have people feel like when they're doing asking questions and they're question only meaning that it's going to lead to something. So they're not just asking questions for the sake of asking questions. Now, one of the things that you want to do when you're asking questions and putting them on the board or pulling them off the questions off the board and your data science team is when people are asking these questions, try not to think about the data. Don't worry if you have the data to ask the questions and try to stay away from yes or no questions, you just want to focus on something interesting. Try to think about second order questions, something like, you know, instead of thinking about why are customers spending less than Jilli, think about why would a customer spend less. Think about sort of like these bigger questions and why do we know or what do we know about how a customer spends or why a customer spends? So these are sort of bigger questions that will let your data science team run some small experiments and try to come up with some interesting results. There's an old joke in statistics that you shouldn't use. Statistics is the same way that someone who's drunk uses a lamppost. You shouldn't use statistics for support of something you already believe. Instead, you should use it for illumination, something to find something new. So question only meetings help you sort of ask these questions and it helps you, helps your organization, your team figure out what they don't know and figuring out what you don't know is one of the most productive ways to have your data science team sort of run experiments and do something interesting. Now, your data science team, what I've seen that works the best is sort of a small team based a lot on the scientific method. Instead of going out and trying to find a data scientist and paying two hundred or eight hundred thousand dollars a year for kind of a unicorn data scientist, instead you should create a small team, sort of like almost like a small team that that uses the scientific method where they're inventing and asking interesting questions. What I've seen that works well is kind of a data analyst, someone just out of grad school that knows statistics, knows how to run reports, knows how to work with the data. Then a role that I call a knowledge explorer or some organization is called Knowledge Explorer, which is someone who runs the question meetings, is really interested in getting good questions. The data analyst sort of really interested in using the scientific method to ask something interesting and then sort of a servant leader role, which is someone who's responsible for kind of getting that some of the results of the experiments out to the rest of the organization do some more practical things like make sure that the data analyst has access to database servers, can migrate data from one server the next that type of thing. So these very small teams I found have done a really good job at asking interesting questions, doing things that are interesting with the data in these in these small teams. I found work better than just trying to sort of put all of your money in one person spending two or eight hundred thousand dollars a year finding a really high end data scientist that that could try to do all these things, instead, have a small team of three people and have them do interesting things with the data. So, OK, we're pretty much out of time now. I hope you enjoyed this webinar, but it's I want you to have these five takeaways from the time we spent together. The first is your data science will depend on the types of data that you're collecting. So you should understand the different types. You should understand the structured, unstructured, structured, semi structured, unstructured data. There's many different ways to manage your data. Remember, we talked about the relational database management systems, the old IBM systems, the no sequel clusters, Cassandre. So there's different ways to manage your data and you kind of should be familiar with all of them and the different pluses and minuses to use each. You can use statistics to explore your data. That's really key. Remember, descriptive statistics, probability correlation, things like that. Remember, correlation doesn't imply causation and then use statistics to kind of explore your data. And then you should kind of in your or exchange your organizational structure to make guesses and run experiments about your data so that you can kind of get more out of it. And so you want to change your organization instead of focusing on things, you know, and proving things, you know, instead be more scientific, more exploratory and make guesses and run experiments. And finally, I want you to sort of think about creating these teams, these small teams of people that can get your organization comfortable with asking questions, a lot of organizations are very comfortable asking questions. And, you know, it's a change. The structure is one of the key parts of having a successful data science team. So I hope you enjoyed this presentation and good luck structuring your data science team and finding out interesting things about your data.

Latest Articles on Data Science: