3

Step away from the spreadsheet – common errors in using spreadsheets, and their ramifications

This chapter outlines exactly why many librarians struggle to produce actionable data. Many of the problems with data can be traced back to the data structure, which in turn can be traced back to the way the production of data is managed by a library. Even the librarians that should know better commonly use a crosstab type data structure for entering their raw data. This chapter explains with library specific examples how profoundly limiting these commonly used raw data structures are, and outlines the emotional decisions that underpin the use of such a structure. The chapter finishes by outlining ten common pitfalls in spreadsheet management, and how to avoid them.

Keywords

Common pitfalls; raw data structure; data structure limitations; reporting; data analysis

A while back I was talking to a librarian who was having problems with the pivot tables I created. She insisted that the pivots were calculating incorrectly because it was showing that 13+4=17, when it should show 16. I told her jokingly, like a police officer would, that she needed to step away from the spreadsheet. Once the penny dropped for her we both laughed about it, and in this instance it was just one of those silly mistakes that we all make from time to time. Excel, however, can be problematic – and if someone is not data literate, they can use Excel to create many headaches.

Excel is incredibly powerful. Very few people understand its true power; and just like a high powered muscle car can be devastating in the hands of a learner, so to can Excel become a problem if it is not driven correctly.

It is possible to use Excel as a tool to interrogate webpages, email, word documents, and open, change, and delete files. I know as I have written many programs using Excel VBA to do many things, such as webscraping, timetabling, conducting fuzzy matches, and writing complex algorithms to automate processes. Excel is much more than just a battery of cells into which you plug numbers and formulas. Excel also has many addins – such as PowerQuery and PowerPivot, that take the functionality of Excel to a completely new plane. PowerPivot can be used to manipulate huge datasets, and its functionality, from the point of view of the likely library needs, brings it very close to that provided by data warehousing tools.

Most people who use Excel are only skating on a very thin surface of its true functionality, happily oblivious to the vast wealth of possibilities beneath them. Now, you do not need to become an Excel guru to manage your data more effectively. You may continue to be happily oblivious to its full power. You do, however, need to change the way you think about data. But this is not a big change. It is not like asking you to understand how time and space is relative, or conceive of more than three physical dimensions. If you can read, then you can with a small amount of discipline, and an open mind, change the way you view data so as that you are able to manage it much more effectively.

I have frequently found that when people see a lot of blank cells, and the option of adding an endless number of additional sheets, the first thing they do is treat Excel like a big sprawling canvas, onto which they can paint their data in the way that suits their individual tastes and needs. And every painting seems to be different, regardless of who paints it, or how many they have painted before. Each one is a special artwork, requiring art critics to help you to interpret the painting. This has to stop. The reason it needs to stop is also the same reason why almost all users default to using Excel as an art canvas.

Almost all library staff seem to have three considerations in mind when creating a new spreadsheet: I want to do this quickly, I want to make it easy for staff to enter data, and it needs to be easy for me to report on the data. By itself, this is a reasonable set of criteria. However, in practice, because staff tend to think of Excel in the same way they conceive tables in a Word document, they produce the least optimal data structures conceivable.

Library staff, and indeed most other staff I have worked with, first and only response when having to create a new spreadsheet is to enter their data into crosstab tables. For example, if staff needed to collect data on browsed items (items that clients have removed from the shelves, but not borrowed), then staff might typically create a spreadsheet that contains date in the columns (vertical), with the library location in the rows (horizontal). They might even have more granular data into the row header, specifying which location in the library the book was browsed.

image

The staff that are entering the data love this structure, because once they are familiar with it, there is minimal data entry, it is easy to see where to put the data, and it is easy to see what data you have already entered. And because librarians tend to love collecting data, they tend to find this exercise satisfying. Data is only rarely reported on, and when it is, it tends to be on the bottom line figure.

Because the spreadsheet creator tends to be focused mostly on making data entry simpler, and because they tend to think of Excel the way someone might look at a table in a word document, they attempt to organize the data as best they can so as that the whole table is visible on the one screen. This typically means a new tab is created for each month, and perhaps a total tab at the end that sums up all the months. Typically, an entirely new spreadsheet would be created for each year.

The problem with this approach is that too much energy has been focused on making it easy to put data into the bucket, at the expense of getting the data out of the bucket. This structure only enables you to easily retrieve the most simple of data views. Yet, so many staff in my experience place 99% of their energies into making it easy to put the data in the bucket. I have said to so many staff, if you are never going to retrieve the data from the bucket, then why put it in there in the first place? You might as well spend a truckload of energy creating something, and then throw it in a black hole.

Rule number one should always be: most of your energy should be spent on making it easy to get the data out of the bucket. If the data cannot be used, the whole exercise will be a giant waste of time, no matter how easy you made the data entry.

Lets say management noticed a lot more clients in the library over the last few years, but noted that loans were declining over the same period. There was some pressure to free the space used by the stacks to create group study spaces. However, being a library, they did not want to reduce the stacks, unless absolutely necessary. So management wanted to know if students were browsing instead of borrowing, and they wanted to know how this was trending over the last 7 years. To answer this question, a staff member would need to open each of the seven spreadsheets, and copy the number from the totals tab, then paste that number as a value into a new spreadsheet. Now, lets say you did this, then gave the data to management and they went, OK, browsing has been declining overall in line with loans, but has this been the case for every one of our locations – and what is the ratio between loans and browsing at each location? So now you have to go back to the spreadsheet, open seven browsing spreadsheets, copy the total for each location into a new spreadsheet, do the same with loans, then create formulas for the ratios. This is going to take you at least an hour – because there is so much manual copying and pasting, you want to be sure that you have got it right, so you triple check all the figures.

Then you go back to management, they look at the data, and say, well it looks like we could clear some stacks at this location, but which level is being used the most? To answer this question you are going to have to open up all seven spreadsheets again, and then add formulas to them, because the totals formulas at the end of the spreadsheet only captured totals for each location, not the totals for the levels. You have to completely rebuild the totals tabs on all seven spreadsheets, double check that they are summing correctly, copy the values from the seven spreadsheets, and paste all those values into a new spreadsheet. This is likely to take 2–3 hours.

These are very simple questions being asked of the browsed data, and they should be able to be answered within 5 min. If anything more sophisticated was asked, such as a correlation between loans and browsed items to help model and forecast demand, then the data from all seven spreadsheets will need to be radically restructured, and this is likely to take a week. Once again, this is a question your spreadsheet should be able to answer in 5 min.

So instead of being useful from a business point of view, the spreadsheets are aesthetically beautiful. They are beautiful because only a select few of the chosen special people can interpret them, they are beautiful because they are unique, one off hand crafted masterpieces, complete with the full array of gaudy colors available to the Excel artists. These spreadsheet provide a wonderful data entry experience too, and they make the users feel all warm and fuzzy. Look how busy we are. As special as these artworks are, however,they need to be taken off the wall, and put somewhere that they will never be used again.

To succeed in standardizing your spreadsheets into a useful data structure, you will need to control who is allowed to create new spreadsheets, and specify which spreadsheets are to be used for which purpose. This may sound draconian, but your choices are simple. Allow users to handcraft artworks that will only ever be useful for the most simplistic purposes, or control the creation and use of spreadsheets centrally. If you want to take control of your business’ destiny, then you only have one option; centralized data management. This, of course, requires full executive support, and many staff are quite possessive of their spreadsheets, and may be reluctant to relinquish control, or to have them changed significantly.

Consequently, depending upon the size of your institution, and the complexity of the change, you may need to conduct some formal training when you roll out the new spreadsheets. Once again, the very first thing you should emphasize at the training, even if they have heard it a million times before, is why you are doing this, and what you expect to achieve. People are much more likely to come on board with you if they understand the reason for change, even if they do not agree with you.

The training issue can be taken too far though. For example, one of the first spreadsheets I started to restructure when I was running my project was the visits spreadsheets. This was and is an important statistic, as it helps the library to make informed decisions about things such as rosters and the timing of promotions. These spreadsheets, like all the others, were a complete mess. The gate counter statistics for the last 7 years were contained in about half a dozen spreadsheets spread over just as many tabs with each tab containing a different structure of data with up to 18 columns. I took this big bowl of spaghetti, and transformed it into a single spreadsheet, with one raw data tab containing four columns: Date, Location, Time, and Visits. I assumed I had made things simpler. Yet after I made these changes I had one librarian make an angry phone call to me to say there have been all these changes to the gate statistics spreadsheets, and demanded training. It was not one of my better days, so my response was not my finest hour. I told her that the old spreadsheets were almost indecipherable, and the new one only contains four columns, and even though it is blindingly obvious how to use them, nevertheless I would be happy to provide her with the training right now over the phone. I said the first column is Date, that’s the date the visits were made on, and it is calculated as the number of days since Christ was born, using a calendar that evolved from Roman days. I then said the next column is Location, and unless aliens had visited earth and teleported your library to a new place, then your library will always be at the same location, so you will always enter the same thing for that column. I said the time is the thing you will find on your watch. Finally, I said the number of visits is the number of people who have entered your library on that particular day, and you read this off the gate counter the same way as you have always read it off the counter – the only difference is you now put it in a different spreadsheet. I said your training is now complete, a certificate will be in the mail shortly, and hung up.

Obviously, this was a terrible thing to say for many reasons, and it did come back to bite me quite a few times. Don’t suffer fools, but make sure you do it in a productive way. It can be frustrating at times, particularly if you are trying to make sensible improvements in an environment that is resistive to change. You need to expect it will be frustrating occasionally, and manage yourself accordingly.

The new visits spreadsheet was resisted, because of two reasons. One it involved change, a change that made data more visible. This now meant anyone could with a couple of clicks see data on visits, without requiring some complex road map. There will always be a minority of staff that relish control, and they will not like the loss of control that these changes bring. The second reason is many people value the time they spend entering the data, but do not grasp the value that data could provide. So to these people such changes appear wasteful.

The ten table commandments

Getting your table structure correct is very important, and this is addressed in the next chapter. There are some other classic problems that you need to recognize as problems, before you start building something new. After all, you are reading this because you want to break the cycle of spreadsheets creating work for you; instead of them saving you work. So below is a list of things that you should absolutely not do.

One – Don’t spread your spreadsheets around directories like confetti. Don’t spread the love! Staff should not need a road map to find statistics. If it is tucked away ten layers deep in a team folder, then finding it might be obvious to those that regularly use it, but to others it might as well be a mythical unicorn! Put all your spreadsheets your staff are using in one place, and if you are no longer using a spreadsheet, then archive it off somewhere, but that should be one place too.

Two – Do not use different structures for spreadsheets. Most spreadsheets will only require three visible sheets: a raw data sheet into which all the raw data is placed, a set of pivot tables, and a contents page linking users to those pivots. You should stick with these same three sheets, and use the same name and layout for all your workbooks. If your raw data headers start in row 10 in one workbook, then there is absolutely no reason why they should not start in row 10 for all workbooks. If your raw data sheet is called “RawData” in one workbook, then don’t call it something different in the next workbook. If you are going to use conditional highlighting to color code locations, then use the same color coding for all spreadsheets that contain the location field. Wherever you can be consistent, be consistent. Consistency makes it easier to maintain, and easier for staff to use. This last point is critical. If some of your staff are scared of Excel, then you will be making their life a lot easier. Once they learn how to use one of your spreadsheets, they will soon be able to use any of your spreadsheets.

Three – Do not use trendy, vague or vain names for spreadsheets. Try to base the name on what is being measured, not the team doing the activity, or even the name of the service. Many organizations put their teams through a revolving door of name changes, both for teams and their services. What is called the “Refuse Solutions Team” this week, might simply be called the cleaning team next week. So, if the spreadsheet is a cleaning roster, then call it that, not something soon to become indecipherable like “Refuse Solutions Team Future Work Scheduling Matrix.xlsx”. A reasonably intelligent staff member should be able to look at the name of your spreadsheet in 5 years’ time, and still know what to roughly expect without having to earnestly furrow their brows in thoughtful contemplation before opening the spreadsheet.

Four – Do not leave your spreadsheets unprotected. Many users are unconscious Excel vandals. If you give them the chance they will delete your formulas, move data, create new sheets, write new formulas (often incorrectly), change your headers, change old data, and generally wreak havoc. Chapter 6 will cover how to do this properly.

Five – Do not show users irrelevant data. Your users will not need to see many of the formulas in your raw data sheet. You might feel they make you look smart, but they will only annoy, or worse, confuse your users. If the user does not need to see the columns, hide them. If the user does not need to see a sheet (e.g., the data validation sheet), then hide the whole sheet. This will reduce the complexity of your workbook, making it easier to use, and this ease of use will help improve data integrity. For example, your staff will be less likely to procrastinate over entering their data for months on end. This leads to the next commandment…

Six – Do not leave data entry unmanaged. If users are left to their own devices, they will enter data whenever they feel like it. This might mean today they add an information literacy class they ran 2 years ago to the spreadsheet. The problem with this is when you run the report now for the number of information literacy classes taught over the last 3 years, the numbers you get for those years will not be the same as the report you ran last year. It is very embarrassing having to go to the Director to explain why the numbers you provided for previous annual reports have to be changed in the current annual report. Depending upon your Director you might receive a response ranging from rage to a resigned nod! This can all be avoided. Every quarter I lock down the statistics. This means I do two things. Firstly I lock all the cells for everything except the current quarter. That means no one can go back and retrospectively fiddle with numbers. The second thing I do is change the data validation rules, so as that no one can enter a date prior to the current quarter. That way staff cannot do something like add an information literacy class from 2 years ago.

At first quite a few staff responded negatively to this new approach. They were used to being able to update the data as they saw fit, and they did not enjoy the more regimental approach. They would say to me after the lock-down period closed, often months after, that “I forgot to add this, can you please add it now.” Most times my answer would be a respectful “no.” They of course responded that the data would be inaccurate, to which I would agree. I would say, however, that if I allow staff to add data whenever they wish, then the data we report on in important documents, such as our annual report, will be even more inaccurate, because it will always be missing significant amounts of data that staff have not been required to enter. The few inaccuracies we get from people missing lockdowns are considerably less than the inaccuracies we would get without the lockdowns. I would then add that I know this is a harsh approach, but I am sure that staff will become more proactive in future about entering data before the lockdowns. And this is precisely what happened. Staff became used to the data lockdowns, and made sure they updated their data in time. If you want your annual report, or business plan progress reports, or any other reports to accurately reflect what you have done during a period, then you will simply have to be ruthless about enforcing lockdowns. The truth is, you are only asking for a modicum of discipline, and those staff that miss the first couple of lockdowns will soon get the picture. It is absolutely wonderful to be able to run statistical reports for a given period, and still get exactly the same number for that period 5 years later, and be confident that the data is accurate. This is how it should be, and how it can be.

Seven – Do not allow users to type whatever they feel like into cells. You will need to apply data validation rules to your spreadsheets for quality assurance reasons. Data validation means the user cannot type whatever they feel like in a cell, they have to enter a specific value from a list, enter a number within a range, or enter a date. Chapter 6 will cover how to use data validation.

Eight – Do not put your lookup tables in the raw data sheet. Store your lookup tables and validations lists in a new worksheet and call it “Validation.” If you don’t know what lookup tables are now, you will by then end of the next chapter.

There are many reasons why you would want to store your lookups and validation lists separately, and this will become more compelling as you progress through this book. Firstly, having the lookup table on the rawdata screen only adds clutter. 99.9% of the time, the only people looking at the raw data worksheet will be the staff doing the data entry. The clutter will not help them to do their job more accurately and efficiently. The second reason for putting the lookups on a different worksheet is because if you extend the number of columns in your raw data, then depending upon where you put your lookup table you may find yourself running out of real estate. You can always move the lookup table, but why double handle when you can put it out of the way for good in the first place. Thirdly, on some rare occasions, you may need to delete rows of data. This can be done without interfering with existing tables, but it is safer not to have your lookups occupying the same rows as your raw data in the first place. Finally, it makes it easier to edit the list. If you have tens of thousands of rows of raw data, you may have to unhide some rows and columns, and then change the freeze panes settings to be able to edit your lookup tables. If your lookups are on a different sheet to your raw data this will never be a problem.

Nine – Never ask a user to enter a value that can be automatically generated via a formula. If you can write a formula to do something, then write the formula. Not only will this improve data entry speed, but it will also make your data accurate. Whenever someone enters something into a spreadsheet, there is a chance they will enter it incorrectly. You need to design spreadsheets to minimize this risk, and one of the first steps to minimizing this risk is to use formulas wherever possible.

Ten – Make people responsible for their own mess. If staff refuse to stop creating their own spreadsheets, then make them responsible for reporting, demand that they be able to do it to the depth and speed that you will be able to with your standard spreadsheets, and ensure they wear full responsibility for data loss and other inaccuracies that will inevitably occur with poorly designed spreadsheets used by multiple staff. There are not many staff that would be willing to saddle up for that responsibility, and if they are, wonderful, its no longer your problem – its theirs!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.55.18