Sunday, June 3, 2012

Rebuild, Reuse, Recycle

As part of an upcoming workshop, I've been rebuilding a spreadsheet developed by someone else to use as a communication tool. The good news is that I was handed most of the necessary formulas. The bad news was that I needed to completely start from scratch with the design, and I still had a lot to learn along the way. This spreadsheet was like Steve Austin, even if it wasn't a $6M spreadsheet.



Here was my to do list:
  • How do you get Excel to display ordinal numbers?
  • Can you make something that looks like a number line...and add dynamic data to it?
  • How do you display a normal curve...and add dynamic data to it?
Fortunately, between Google, YouTube, and Twitter, I was able to make short work of this list. Here is how everything is coming together. This is very much a work in progress---so if you have ideas to help make this awesome, I'd love to hear them.

The goal here is to allow a user to input a value for Effect Size and see how that relates to a variety of other measures: an Improvement Index, the standard deviation, and variance.


Below the space for input, I have set up the space for the other measures.

Before we talk nuts and bolts, I want to talk a little about design here. This is a tool that will be handed off for others to use---and has to enable them to communicate with their own stakeholders. Whatever gets placed in the workbook has to be both rich in meaning and self-explanatory. I want a very simple layout and colour scheme to help direct the eye. I also want to be sure that even if someone is not comfortable with statistics that they could still gain some insight from the graphics. The baseline information in the sheet remains grey; values turn green. This is what a user would see if the effect size is .4:


So far, I like it. I'm wondering about adding some conditional formatting to indicate the strength of an effect. In other words, is an r value of .2 "good"? It may be too much to put here, but it would be great to have a way to help people compare or evaluate what they see. For example, in the education world, an effect size of .4 is the average. Anything below that still represents something effective, but if you're looking to make a big impact in the classroom, perhaps that particular strategy might not be the best. Is this important to add? What about other explanations/resources?



Okay, so let's take a bionic jump into the nuts and bolts and the questions I had above. The first one, about ordinal numbers, came from developing the text shown on the right. The sentence you see is based on a formula: some text in quotes, the "&" symbol, and cells with the dynamic values (in this example, 66% and 66th). But Excel has no function for displaying ordinal versions of numbers. So, how do you tell Excel when to use st, nd, rd, or th at the end of a number?

According to Teh Googles, you use this formula =A1&MID("thstndrdth",MIN(9,2*RIGHT(A1)*(MOD(A1-11,100)>2)+1),2), where "A1" equals the cell with the number you want to add the ordinal ranking to. I grabbed the formula from here, and the post also includes an awesome explanation of why it works. In my spreadsheet, I used three cells for this. The first was for the actual calculation using the NORM.DIST function (=NORM.DIST(EffectSize,0,1,TRUE). Even though I can choose to display the results of that cell using two digits, Excel is sneaky and remembers the whole string. So, as an intermediate step, I had Excel take the contents of the cell and change it into text =TEXT((D2*100),"0"). Now it can't use a long trailing decimal. Take that! I used the contents of this cell for the ordinal formula.

A magic button to insert a normal distribution is missing, too. Are we or are we not living in the 21st century? No flying cars. No "normal" chart in Excel. WTH? My kludge, in this case, was taken from a YouTube video and workbook that I grabbed information from ExcelIsFun. I copied values from the workbook and created a basic area graph. For the dynamic green region, I used the following formula =IF(H3<=Variance,I3,""), where "H3" was the cell with the value for the x-axis of the graph (I had to make all values positive) and I3 had the value I copied from the workbook---the one matching the grey graph. This creates a very narrow range of data points around 0 standard deviations, which are then added to the graph.

The answer for the final piece, the line-type graph for variance, came from a Twitter shout-out.
https://twitter.com/science_goddess/status/208628563790925826
I drew what I wanted, attached it to the tweet, and crossed my fingers for brilliant ideas to head my way. Here are the two I received:

https://twitter.com/BlaskEric/status/208635776714543104
https://twitter.com/Jon_Peltier/status/208748764951875585
I ended up deciding to just have the scale go from 0 to +1. There are negative correlations, but with the formula I was using from the original spreadsheet, there was no possibility of ending up with a negative result.

This has been a great small project to work on. It's also been a reminder that you don't have to know everything about Excel to get where you want to go. There's lots of help available from those who have previously solved problems. Makes it easy to rebuild, reuse, recycle the worksheets you have.

We can rebuild our workbooks. We have the technology. We have the capability to make the world's best spreadsheet. Our workbook will have that spreadsheet. Better than it was before. Better...stronger...faster...



Bonus Round
If you want to play around with the workbook, you can download it here. The final version will be made available at a workshop in two weeks, so if you have any feedback or improvements to suggest, get on it.

Sounds from the Bionic Woman and Six Million Dollar Man from here.

Wednesday, May 30, 2012

Housekeeping

In a fit of spring cleaning, I've reorganized some content for this blog. If you're viewing this post via RSS, you won't notice a thing. For visitors to the site, you will notice some updates.

Pages


I have created three new pages, all with their own permalinks. Depending upon the content of future posts (stats? add-ins?), I may add more pages. I do plan to take advantage of the more dynamic nature of the pages to update content.

All of the posts about building and using a gradebook in Excel are now in one place. Sure, you can use the gradebook tag on any of those posts to see all of them, but the new page houses things chronologically and with some additional text to help guide users. I've also moved the various resources off the sidebar and built a page just for Books and Links.  Finally, I've deleted the blogger profile info from the sidebar and built an About page with a statement about the blog, my contact info, and some backstory.


Blogroll

I've refreshed the blogroll with some great new reads. I hope that you'll check them out:
  • chartsnthings is the (personal) blog of data sketches from the New York Times graphics department. I love the metacognitive aspect of this blog---a peek into the thinking of designers as they build visualizations.
  • "Data Remixed is a blog dedicated to exploring data and sharing insights in an engaging way."
  • Visit Tableau's Viz of the Day for a variety of visuals built by users like you. 

Subscriptions

I've added to the choices you have for getting information from and about this blog. You can still use RSS, but now you can use the buttons on the sidebar to follow me on Twitter, add this blog to your Facebook feed, or add my YouTube channel. Just click and go!

If you have additional suggestions to make the site easier to navigate or links/tools/resources that should be included, let me know.

Sunday, May 27, 2012

Statistically Speaking with Excel: Basic Descriptive Stats

If you want to do some heavy statistical analysis, there are better programs than Excel for doing so. But, let's say that you aren't into #bigdata and just want to do some small scale divination. Excel can totally be your bff for that.

Educators are most familiar with descriptive statistics: the ways we describe a set of data. What is the shape of the distribution of values? Where is the "middle"? What can we say about the population of values and their relationship to one another?

I have pulled a new data set to play with...one with test scores for about 500 students. While you can do statistical analysis with your gradebook---and let's face it, most teachers do in terms of how they assign final grades for students---I am a little squeamish about doing so. There may be no magic number that applies to every sample size, but I want us to play with something we can have a little more confidence in than the basic gradebook I've posted here. You can download the test scores workbook here.

This is the basic set up:


I have divided the data set into two schools: A and B. Each student is identified by their first name, and in some cases, a last initial. There is a raw score for reading, writing, and math, as well as a level of performance (1, 2, 3, 4). The range of score points for each level plays out like this:

You may look at this and wonder about what appears to be missing points (Why can't anyone get a 399 on the reading or math test?!) and why writing has a different scale. There are reasons...good ones...but I'm not going to get into them at the moment. These data do come from (old) state test results: first names of students are unchanged, but I did fill in some missing scores. We're going to play along with the rules that were originally applied to determine the scores. So, do your best to overlook the oddities of these scales for now. For these tests, a score in Level 3 would mean a student can meet the standard (Level 4 = exceeds, Levels 1 and 2 = below standard).

Let's start with measures of central tendency (mean, median, and mode) for our reading data (C2: C517). For the mean, Excel uses the AVERAGE function...for median, oddly enough, we can use MEDIAN...and for the mode, it depends on which version of Excel you have. In olden versions, the MODE function worked just fine, but starting with 2010, you have choices. To just get "the" mode, use the MODE.SINGL function. To get multiple modes from an array of data, you can use the MODE.MULT function---a very handy improvement. Not every data set is bell-shaped. So, here is what we have for the reading scores. I placed the formulas in the table on the right so you can see the syntax.

What does this mean? Well, first of all, our mean, median, and mode are all about the same. It's not necessary for your measures of central tendency to agree. After all, each one is a different way to identify the "middle." It's up to you to determine which is most appropriate. However, in this example, no matter which one you choose, you'll be fine.

If you want to graph this data set, it's not so friendly in its current form. We'd be better off building a frequency table first. This will allow us to find out how many students are in each category, then create a graph to visualize the distribution. To keep things simple, let's find out how many students scored in each level for the reading test. Build a simple table first, then select the empty cells:

Now you're ready to add your formula---a single formula ("array formula") which will fill all the cells at once. Then belly on up to the formula bar (too bad you can't pull a beer from here) and start entering the formula:

The "data_array" will be the column of data (not including the header) with the data about reading levels. The "bins_array" refers to the cells that have the labels for the reading levels---in this example, the 1, 2, 3, 4 you see in to the left of the cells selected in the table above. The final formula looks like this (plus a close parenthesis at the end):


When you've finished entering this information, you need to use a command to fill in all of the cells in the table. Plain old ENTER will not work. You have to use CONTROL + SHIFT + ENTER. And poof! We now have a frequency table:


If this scares you, you can use a simple COUNT function in each cell, but hey, you're ready to use big kid functions. Give FREQUENCY a try. Use the filters in Excel and compare School A with School B.

This is a good spot to stop for today, but we'll come back to this data set another day to see other descriptive statistics in Excel...and then move on to inferential analysis.

Bonus Round
Did you know that Excel has a whole category of statistics functions? Get in there and play!

Monday, May 21, 2012

Using Add-ins: MapCite

Out of the box, Excel is awesome all by itself. And yet there are those who endeavour to kick things up a notch by inventing add-ins...a kind of extreme macro. This specialized code enables Excel to do all sorts of new tricks. In previous posts, we've looked at Sparklines and ASAP Utilities.

I was recently pointed toward MapCite and made some time to give it a try. With MapCite, you can visualize your data using a Bing map embedded in Excel. The add-in looks something like this*:

*See the Bonus Round at the bottom of this post for more information.

For my data set, I pulled from one we've seen here before, the 5th grade state science test results from 2011. I added district address information.


Now, we're ready to "Geocode Data." Select the data---not the columns, or MapCite will tell you it can't handle that much work---for the addresses. Then click the "Geocode Data" button.
 

 In the pop-up window, fill in the information. Then, click "Geocode."

 Holy cow! We've just generated a whole new set of data:

There's more where that came from...
Now, you can add the mapping features. Select the data in the Latitude and Longitude columns, then click "Add Data."

In the pop-up window, indicate the required information and click "Finish."

Let's have a look at what we have wrought. Click on the "Show/Hide Map Pane" button. Here is our first view:


Not too exotic, but that's because MapCite automatically clusters the pins so things don't look messy.  Here's an unclustered look. Note that when you click on one of the pins, the row in the worksheet with the matching data is automatically highlighted. Keep in mind that you can also change the base map that is used.



You can also use the HeatMap feature to take a look:


You may be wondering why I bothered using science data when we just used the addresses of all the districts. It is a bit of a head-scratcher. But, with filtering tools in Excel, you can choose which groups to look at: small schools, those which scored above the state average, etc. You can also add GPX data (data that shows a route that was followed).

What I like about the add-in is that it's easy to use and that I can see the map in my spreadsheet. I don't have to upload my data elsewhere and pull it into another application. However, at this point, MapCite is fairly limited in features. You can make different pins for different pieces of a data set, but you can't show more than one set at a time. It's great to have things on a map, but I need to derive more meaning than just concentration. As such, classroom applications (other than what students might look at it) are limited. I think new features will come in time. I've been promised an upgrade to a Pro account when it's ready for release, and will let you know what else you can do.

Have a look around the MapCite website or YouTube channel for additional information. Better yet, give it a try for yourself.

Bonus Round
I say "like this," because I couldn't get the add-in to install properly. Although MapCite was very responsive to my inquiry for tech support, we couldn't figure out why Excel was making the add-in invisible. I ended up kludging things by creating a new tab for the ribbon called "MapCite," then dragging the groups from the tab-that-refused-to-show into my kludged one. Tech support said that they haven't had any issues similar to this one, so don't let my experience put you off. Also, if you have any clue why Excel shows the MapCite add-in as being active, but won't put it on the ribbon, I'd love to hear how to fix this.


Saturday, May 19, 2012

Give Me That Old Time Religion

We are not new acolytes in the service of charts and graphs. Some of us may worship at the altar of Excel, but that does not mean others have not had their own methodology and purpose.

It is surprising how many schoolmen avoid charts and graphs as though their use required calculus and the practice of the black art. Successful practice of school administration is no different from the practice of business; both require the constant habit of visualization. Many problems of school administration involve the clear and concise presentation of facts and relationships. The stump speaker to the contrary notwithstanding, facts do not speak for themselves. Relationships have to be made vivid to be understood. Unless facts and relationships are presented in a manner at once clear and interesting, they are as useless as locks without keys. Charts and graphs present facts and relationships more forcefully, in less time, and require no more space than the same facts presented in words.
---Bowman, E.L. 1921. Graphic aids in school administration, article I: visualizing organization facts. American School Board Journal, 63(12), 29 - 31. 

Last month, I introduced you to E.L. Bowman---enthusiastic proponent of data visualization as watchtower for schools in the 1920's. I have two more of his articles to share. I like these little historical snapshots. Technology may have been a limiting factor in getting these ideas to take root, but there was still a lot of deep thinking happening about the potential of using data for educational purposes.

In the first article, Visualizing Organization Facts, Bowman extols the virtues of organizational charts like this one:


He also looks at flow charts for coursework and job responsibilities. Bowman includes rules for drawing and duplicating charts, while reminding us that "If [a superintendent] causes these plans to appear in the form of graphs, he makes easier the comprehension of his ideas by others." Mind you, he also likes the idea of using these charts to remind others who's the boss. (Not so different from today.) While I have rarely seen org charts used by schools and districts, I do think that visually presenting processes could be a very powerful tool for schools. What are the pathways for graduation? What interventions are available for students who need additional support for their learning? What are the next steps with behavioral issues? How could graphics help communicate with non-English speaking families about navigating school with their children?

The last article that I have (so far) is about Establishing Routine through Graphs. Here's my favourite quote from this one:
There is scarcely a state school system in the country ·that has not in the past called for unnecessary statistics in the compilation of its annual reports from subordinate districts. Much of the data thus demanded on pain of forfeiture of state subsidy was never used, but was embalmed somewhere in a bulky printed volume.

Remember, this was 80 years before NCLB. I'm telling you, the dude was ahead of his time. Anyway, in this article, Bowman focuses on the procedures schools use to make purchases, track time, and develop lessons.

In most cases, I feel like these visualizations are cumbersome and unnecessary. I'd hope that most of these "routines" are streamlined today, either due to better prepared workers or technological advances. I was amused at the end of the article to read Bowman describe the awesomeness that is having a calendar on your desk. Word.

If my forays into Google Books reveal any more of the long lost gospels of data viz in schools, I'll be sure to post them here. For now, go in peace.

Tuesday, May 15, 2012

Animated Graphs in PowerPoint

Charts and graphs are used to tell a story to an audience. And as with any good tale, you may want to parcel out one chapter at a time. There are lots of tools to help you get the job done, but today we're just going to look at moving your graphs into PowerPoint.

Last year, one of the presentations I was giving was about using cell phones in the classroom. At the beginning of the presentation, I was sharing some data about how students access the Internet. I had pulled data from two different Pew Internet studies. One had data comparing access to computers and cell phones, based on family income. The other had data reporting how teens used cell phones to access the Internet, also based on family income. There is a bit of a surprise in the data, so although I wanted to show the big picture, I wanted to portion it out so people would get the "A-ha!".

The easiest way to do this is to build three different slides: one for the first data set, one for the second, and one for both. But why do it the easy way? Instead, you can use some simple animation to reveal each piece of the puzzle.

Start by building the graph. I prefer to do this in Excel and then copy it into PowerPoint. Here is the full meal deal slide:


Now, I'm ready to add animation. At the beginning, I don't want the blue bars to show. The orange and green show what you might expect. As household income increases, so does access to a computer. Access to a cell phone is similar, but not as dramatic. To make the orange and green bars show first, select them and go to the animation tab. Then, select "By Series" under "Effect Options." I don't recommend fancy animations. Just because you can add spins and flashes doesn't mean you should. A simple fade is plenty.


This will make the first view people have of the slide look like this:



Then, "Add Animation" for the next series (the blue one) to appear...and for the orange series to disappear. Ah, now we can see something interesting: Students from lower income families use their phones the most for going online. (PS---don't you love how the green bars don't look like they're the same shade as the ones above? They're the exact same, but it's a good example of how colour is influenced by context.)



Finally, bring the orange series back for the final comparison.

If you want to see the finished product, you can use the embedded view below. The vagaries of SkyDrive mean that you will need to choose the full screen option to make the animations work. Once it opens, just use your right arrow or page down key to view the results.

Tuesday, May 8, 2012

It's What You Do With It


https://twitter.com/#!/science_goddess/status/198786771260162049

If I asked you how big your data set was, would you hold it against me?

Most days, I feel like what's happening in the world of data is a lot like biological evolution. Outside pressures can drive internal changes. I see a Cambrian-levels of explosive growth in data use and modeling today as we experiment with what outside tools can do to internal data sets. And "Big Data" is rapidly becoming the king beast on the block. Even education cannot escape this growth, with federally funded longitudinal data systems and the other ways districts and states are putting together their data.

https://twitter.com/#!/gnat/status/191606855666118658
But this blog is about small data. I want individual teachers and principals to be able to answer their own questions---without fancy tools, mind-bending formulas/programming, or worrisome charts. Small data has its challenges, however. Statistically, we are on shaky ground. We haven't kicked it around much here, but I do wonder about the validity of the instructional decisions we make based on an assessment of a class of 20 - 30 students. I realize that classroom-level data serves a very different purpose than #BigData. As teachers, we try to support each student. But what represents a reasonable "sample size" for making these determinations? What do validity and reliability look like within the microcosm of small data?

In spite of these (and other) questions, I believe that small data has a significant role within the data ecosystem. We just need to help more people see what that is. I worry that #BigData is on a path to be too large and in charge of what happens in a classroom between an individual teacher and a student. We need to remember that it isn't the size of the data set that matters, it's what you do with it.

Bonus Round
If you live in the Seattle area and want to talk about all sizes and shapes of data, I hope you will join the Seattle Data Visualization Group. The next meetup will be Tuesday, May 22, venue TBD (but likely close to the Tableau folks). Don't worry about the size of your data set---you'll find lots of friendly people who have interesting projects and ideas to share. There were only two of us ed folks there last time, and being able to share our challenges and applications made for great conversation. I'd love to see you there!