EXCEL: ANALYSING AND MANAGING DATA
This course is based on the Excel course in Linkedln by Oz du Soleil.
You’ve just gotten some data. Now what? That’s what this course is about, the now what. In this course, you’ll see assessing data quality before you dive into analysis and start generating reports.
The importance of checking data quality
Let’s talk about the importance of checking data quality. So, let’s say you finally got the data and you’re ready to start doing some analysis and passing it off to colleagues and bosses. You are so excited, and you just learned pivot tables and how to unpivot in power query, and I am going to say, stop, look. Data can fall anywhere between trustworthy and worthless.
Too many people assume that the data that they get is trustworthy and they dive in, start doing analysis and inspiring decisions, but they haven’t assessed where they are on this spectrum. So, the first question you should ask yourself is, can I trust this data? Why? Because there are people who are impacted by our data management and data analysis.
Now, let’s look at a few things that corrupt data quality. Here, we have a list of cities.
We have inconsistencies. There’s NYC, there’s New York NY, NY, and if you did analysis on that, this can cause problems for you because you’ll have data tied to NYC, data tied to New York and data tied to NY, NY. You got to check this kind of thing out, the inconsistencies. Look at the guests. We have Rick McCray, Celia Silva and James Fielder’s name is inconsistent with the others. Oh, and we have two Celia Silvas and a family of S. Archer, and then just plain Bob. This has got to be handled. Now, column F, we have name, birthday, and occupation all in the same cells. That would need to be split out. You might want to sort by the birthdays, or maybe you don’t want the birthdays at all and need to dig ’em out and then delete ’em. You’ve got to be aware of this and clean up what you can. So, you have to be really clear about what you’re trying to do, what you need, if you’ve decided that the data is clean enough and complete enough, then yes, go ahead. Go forward with your analysis, but always think data quality. Where is it on the spectrum? Is it trustworthy or is it worthless?
Use the UNIQUE function to investigate data and data quality
Here is some data about people who studied and then took an exam.
We’ve got Moises who took Course A and passed. The instructor was Nirel, and the test location was Eighth Avenue, and this was not a retake for Moises. But now look at Jody in the fifth row. There’s a lot that we can get from this data if we can trust it. We could look at, do people prefer any test location over others? Is there any correlation between people who fail the exam and the instructor who prepared them for the exam? But before we start running period tables and writing formulas, we need to see how trustworthy this data is. Now, I’m going to go to Filter, and then I’m going to go to Pass Fail.
Okay so, we’ve got a 96, where we should only have a pass fail. And this is an example of where we really need to understand our data. So in this situation, 96 does not make sense. That’s clearly a pass. So I’m going to unselect Fail, unselect Pass, okay? Filter for this 96, and then I’m going to change it, to a P for pass.
Okay. The 96 is gone. We can clear this filter. But now let’s look at the instructor. There’s a lot to clean up here.
We got K space C, K period space C period, KC. Let’s use the unique function equals unique open parentheses.
This is the array that we want to look at. Enter. Okay, now we can see KC is an issue, then that, and then K space C. So I’m going to go through KC and notice that unique is dynamic. It changes with the updates that I make. These are the names of the instructors, they’re all clean now.
Let’s look at the test locations I’m going to highlight in that formula bar and then drag this over. Now we’re looking at the test locations, enter, okay.
And now notice there is Eighth Ave and Eighth Avenue. We’ve got to pick one or the other, we can’t have both. So one thing that I can do, I’m going to highlight, and then in the Home tab, go to Find Select, and then Replace.
For here, Find what, Eighth Avenue. Replace with, Eighth Ave, and then Replace All. Close. And you notice Eighth Avenue is gone. And then there would be Polk Street Towers, Polk Street, and then this misspelling of Polk Street. And then Polk with street fully spelled out.
So these are the types of things that we have to look for before we do any type of analysis with our data. And the unique function can help assess how trustworthy the data is and help monitor our progress in getting the data cleaned up so that it is trustworthy.
Check data quality with a PivotTable
Pivot tables are great for doing analysis, but I’m going to show you how we can use a pivot table to check our data quality.
We have this data that’s come in and we want to analyze it various different ways, we want to check each rep’s sale amount, we’d like to compare product lines, we’d like to look at the reps vis-a-vis accounts. Check the data quality. Now I’m going to go to Insert and then Pivot Table. Here’s the pivot table wizard.
Let’s look at the rep. And then let’s look at the product lines with that here and then sales amounts. Okay. So now we see a problem, there’s Norm and Norman.
In this situation, they are the same person. Got to clean that up. It’d be an awful situation if Norman missed out on a bonus because of some sales that were tied to Norm. Let’s go back to Sheet 1 and I am going to open the filter buttons, go to Data and Filter, and I’m going to select Norm, and oh, I also see Jean-Pierre and JP. JP is Jean-Pierre so I need to check that.
Okay, now let’s get these fixed. Jean-Pierre and then Norman and that’s Jean-Pierre as well. Clear the filter. Go back to the pivot table. All right. Nothing’s changed because we haven’t refreshed yet. Still in the Data tab, I’m going to go up to Refresh All. All right. No more JP, no more Norm.
Notice we’ve got Home Decor and then Home Decor. One has an accent over the E and the other one doesn’t, that needs to be cleaned up. That’s all got to be cleaned up.
Next, let’s get rid of Product Line here and get rid of Rep and then go to Account. And then I’m going to put Account ID underneath. Now, scroll over, okay. We’ve got CCL and then two account numbers underneath that D, space, S, space, A. And then DSA, inconsistent.
So you really need to understand what you need to get done what you are trying to accomplish. And that’ll determine how clean is clean enough for you to move forward with your analysis.