What do you do when you have a blog entry to write, but the seventh game of the World Series is on?
You make a choice. In this case, since the blog is not time sensitive, I elected to watch the game. I’ll be back with a post next week.
Congratulations Nationals on becoming the second Washington franchise to win a World Series!
In this post I’m going to show you how to work with text files and turn them into something useable within Excel. Sometimes you may have a report that you want to work with that’s in text format instead of Excel. Plus, it’s designed to be a report with many pages with column headings on each page. Trying to work with this in Excel can be challenging. I’m going to show you how I’ve worked with these in the past.
I’m going to use some data I pulled from a government site. This file is a text file listing of weather stations. Here’s the file if you would like to follow along.
To open this file in Excel you will need to use the Open, Browse command then navigate to the file’s folder. In the lower right of the Browse box, click on the dropdown for file type. The default on mine says “All Excel Files”. Change this to all files. You should now be able to see the text file, click on that to open the file.
Excel will bring you into the Text Import Wizard. In the first step make sure the original data type is Fixed Width and leave the other options the same. Then click “Next >”. In step two we’ll be setting the places where Excel will start a new column. It’s best to use the scroll bar on the right to get down to the actual data you’re going to be working with. In this case move down until you can see “Alaska” at the far left and then data below it. At this point you would look at the data and put in column breaks to separate the various data fields. I put my breaks in at the following positions: 3, 20, 26, 32, 39, 47, 55, 62, 65, 68, 71, 74, 77, 79, and 81. Then click “Next >”. (If you do this for another spreadsheet, make sure to write down these numbers, because there are cases where you need to recreate what you did.) For this spreadsheet we don’t need to do anything for step 3, so you can click on “Finish”.
In the spreadsheet that opens scroll down to row 41. You will see Alaska spread over two columns followed by the data with column headings. To make this easier to work with let’s get rid of the rows above this that we don’t need. Delete rows 1 through 40. Now we have the data we want to work with. First of all, we’re going to put in the missing column headings in column O and P. In cell O2 type in “Priority” and in cell P2 type in “CNTRY CODE”.
Most people don’t know the two-letter code for all fifty states, much less the Canadian provinces and other countries that are use later. Let’s add a column where we can pull in the heading with that data into each row. First insert a column at the beginning of the spreadsheet. In cell A2 type in “Location”.
Now in cell A3 we need a formula that will return “Alaska” or whatever names are used later on. You can see in row one Alaska is spread between two columns. So, well concatenate (meaning put them together) them. In cell A3 put in the formula B1&C1. This will put the text from these two cells together. But you’ll notice that if you increase the width of column A, that there is also a 1 in that text. We don’t want that; it belongs with the date in the next couple of columns. Counting the number of spaces, we find that come before the number gives us 16. We want a formula that will concatenate cell B2 and first 16 characters of C2. The function LEFT can be used to give us the formula to do that. We’re also going to want to get rid of the extra spaces after the end of the name but before the number. The function TRIM will do this. Change the formula in cell A2 to TRIM(B1&LEFT(C1,16)). *[I’m not going to explain how the formula works as you can get that information from your Excel help.] That gives a result of “ALASKA”.
Now to extend this to cover the whole spreadsheet. We can’t just copy the formula down all the way to the end. That would only work for the first row of each location. We could use a formula to pick up the result from the cell above each cell (For example cell A4 would have a formula =A3). But that wouldn’t work when the location changes. What we really need is a formula that will either pick up the result from the cell above or if the location changes, use the formula we put into cell A3. If you’re thinking an IF function might work, you are correct. But first we need to figure out how to tell when the location changes.
Scroll down to the spot where Alaska changes to Alabama, row 211. This way we can see what happens when the location changes and figure out how to use that in our formula. I’m looking at the columns to the left of the location and date and seeing that there empty. Let’s use that. Return to the top. In cell A4 we’ll use the following formula: =IF(H2=””,TRIM(B2&LEFT(C2,16)),A3). This formula should accomplish what we wanted to do.
Unfortunately, upon examining the results, there are some stations where the location is blank. Investigating shows that the locations in the US, Canada and Mexico all have column headings while the others that are all further down in the spreadsheet do not. Looks like we need to adjust our formula. For locations that do not have a column heading we want to adjust the formulas to refer to a different row. Setting up a data filter on the spreadsheet, we can filter on one of the column heading titles and see that the last column heading is in row 4653. We can add another IF function to do this. So the new formula for cell A4 would be: IF(ROW()>4653,IF(H3="",TRIM(B3&LEFT(C3,16)),A3),IF(H2="",TRIM(B2&LEFT(C2,16)),A3)).
Copy this formula to the end of the data. You can page through the spreadsheet to verify the formula has worked. Use save as and save the spreadsheet as an excel file. Just in case you won’t have to reenter the formulas if there is a problem later. Also, if the file you worked with changes, such as a report that covers a different month, you can just copy the formulas rather than recreating everything.
Now we need to eliminate the rows that do not contain data. This is easy to do. Because we’re going to eliminate rows this will mess up the results of our formulas in column A. We can change those formulas to data, so we don’t have to worry about that problem. Highlight all the data in column A and copy it. Now while that data is still highlighted do a paste, special, values to change everything to data instead of formulas.
Next delete row 1. Now the column headings are the first row. The next step is going to involve sorting and to make sure we can get the data back in its original order easily we can add a column with the current order. In cell R1 type in “Row”. In cell R2 put in the formula =ROW(). Copy that formula to the end of the spreadsheet. Then copy and paste special values to change the formula to numbers.
Next click on Sort. In the popup box make sure the box next to “My data has headers” is checked. Then set the sort by to CNTRY CODE and sort A to Z. Moving to the end of the country codes you will see all the rows after that are the ones we want to eliminate. You can delete all these rows to the end of the spreadsheet. Then if the spreadsheet is resorted A to Z for column R we have all the data in its original order.
I wanted to show you this so you could get the general concepts. Sometimes data isn’t in an easy to use format. But if you can convert it to Excel you have something that is easier to work with.
Trees and Forest Challenge Answers
All requests that mention Trump: Find these by clicking on the drop down for Request Description and typing trump into the search box above the list at the bottom. The resulting filter should show 4 items. Tracking numbers FY19-25,31,37 and 45.
The quickest responses: Find these by first clicking on the drop down for Final Reply Date and in the list at the bottom uncheck (Blanks). This will remove all of the large negative number items from Days. Then click on the drop down for Days. In the list at the bottom uncheck Select All then click on zero to select that one. The resulting filter should show 4 items. Tracking numbers FY19-03, 57, 58, 59 and 82. My guess is that these took zero days because there was nothing to report for these items.
The final reply date entry error: This one was a little trickier to find. When you looked for the quickest responses you might have noticed that after eliminating items that didn’t have a response date there was still one negative item in Days. Selecting that pulls up Tracking Number FY19-60. The final reply date is 3/12/18 which is nearly a year prior to the date received. This probably had the year entered wrong and should have been 3/12/19.
[Note: Apologies if you visited on Wednesday looking for a post. I usually write in the evenings and we had a power outage last night. My posts will be up by Thursday mornings.]
What’s that old saying about not being able to see the forest and the trees, or is it the trees and the forest? OK, Google tells me it’s “see the forest for the trees” with the first known use being in 1546.
In this post I’d like to show you how to take a forest of Excel data and pull out the trees that interest you. You might have a large spreadsheet with a lot of data that you’ve summarized and graphed to show the forest. But sometimes you’ll need to look at specific trees where there’s a part of the forest that’s not in line with expectations. That’s where data filters can help you.
First, we need a spreadsheet to work with. I like to work with real data; it’s a lot easier than trying to develop sample data. So, I have used the Advanced Search function in Google to find a spreadsheet that we can use for practice. I searched for Excel files that were published on websites the had a domain of “.gov”. I used the dot gov filter to make sure I was getting something in the public domain. I found a spreadsheet of freedom of information act (FOIA) requests to the US Trade Representative. You could do the same. Or you could just use my copy here:
Let’s open the file and take a look at it. We’ll work with just the first tab that is named Intake. Let’s set up the data filters so that we’re able to pull out just certain records to view. This is the easy part. First click on any cell within the data that you can see. Then click the filter icon. If you put this on your quick access toolbar (as I discussed here: https://www.tkanecpa.com/blog/excels-quick-access-toolbar) all you have to do is click on the icon that looks like a funnel (pictured on the left).
Or if you haven’t set this up on your toolbar you can find it on the Data ribbon. Alternatively, because there’s usually multiple ways to do things in Excel, on your keyboard you can simultaneously press the CTRL, Shift and “L”. Excel automatically figures out the top row for your data and how and the data to include. Dropdown arrows appear on the headings in row 3. Now you have filters set up.
Let’s see who’s been requesting data. Click on the dropdown arrow in column D, Requester Organization Name. You’ll see a menu followed by a list appear. We’ll try looking at just one item. In the list at the bottom click on the check box next to Select All. This will remove the checks from all the boxes. Now click on the check box next to the first item and we should see only the AFL-CIO’s requests.
But that didn’t work. You will see the AFL-CIO data request, but there’s a purple row followed by all these other requests. What’s going on? Remember back when you clicked the filter icon and I said Excel would automatically figure out what data to include. It turns out that the method it uses to figure that is to assume that an empty row or column is the end of the data. So, if there’s an empty row in the data (even if it’s colored purple) it assumes the data ended with the row above that. In order to fix this, we’ll need to turn the filters off. Just do the same steps again that you used to turn to filters on and they will turn off.
There are a couple ways to correct this. The first would be to just delete the blank row. Which in this case is fine, because looking at the data I can see that there is only the one blank row. That’s what I’m going to do as we work with this data. So, go to row 15 and remove it. (Right click on any purple cell and pick Delete from the menu that pops up. On the next menu pick Entire Row and then OK. The row is deleted.) You can now turn the filters back on.
The other method is better if you’re dealing with a lot of data and don’t want to find all the empty rows or columns and delete them. You will need to choose all the data from the top to bottom and side to side. With this spreadsheet you would start at with the column headings at cell A3. From there highlight the range A3 through G122. With that area highlighted turn the filters on.
The filters will now include all the data. Now filtering for just the AFL-CIO will show just their one request. You will also notice that there are a lot more Requesters to choose from when you look at the list.
Now that we have our filters, let’s work with this data a little bit. Let’s say we want to look at how long it takes to get a response from a FOIA request and be able to filter that.
We’ll start by setting up the formula to calculate how many days it takes. Let’s insert a column for that data. Move your cursor up to the top of the spreadsheet to the column headings and put it on the “G”. Right click your mouse and on the menu that pops up click insert. A new column appears, and it is already filtered. (We could have just put our formulas in column H, then we would have had to turn the filter off and on again to pick up the new column.)
Move to cell G3 and give the column a title. I’m calling it “Days”. Now move down one cell to G4. We’ll just do a simple formula to give us the difference between the “Final Reply Date” and the “Date Received”. In cell G4 enter the formula =H4-B4. The date 12/19/1905 appears. This is due to the way Excel treats dates, if we convert this to a number it will give us the days between the two dates. I just click on the comma icon on my quick access toolbar to do this and it becomes 2,180.00.
Now we can extend the formula to all the data. I do this by picking cell H4. It now has a border around it that includes a small filled in square in the lower right-hand corner. Holding the mouse over the border it will change to a black cross with arrowheads on each direction. Moving it over the small square the arrowheads will disappear. Double clicking at this point will extend the formula to the end of the data. We now have our forest of data, can we use filters to look at some trees?
Now if you left click on the G in the column heading, you’ll see some stats – average, sum and count of the data in column G.
But how can the average number of days be negative? Look at row 5, the AFL-CIO again. This request doesn’t have a final date so the amount in column G is a big negative number. We can correct that. Leaving column G highlighted, click on the dropdown for column H, then on the list at the bottom unclick the check mark in the box marked (Blanks). This will filter out the requests that have no final date. Now the average response time changes to 81.56 days.
Now, try clicking on the dropdown for column G. Instead of filtering on the list of numbers, let’s try something different. Just above the list you’ll see the menu choice “Number Filters”, click on that. You’ll see some choices that you can use to look for some data. Click on Top 10, then click “OK” on the box that appears. Only the ten requests that took the longest time to finalized will display and the average time for these is 460 days.
Let’s try something else. Start with nothing filtered. So, click on the dropdowns for columns G & H then click on the “Clear Filter” menu choice. You’ll notice that when the data is filtered that the row numbers on the left of the spreadsheet change to blue. When your filters are cleared, they will change back to black. Now click on the dropdown for “Request Description”. Just above the list of items there is a search box. You can type terms into that box to use as a filter. Try typing NAFTA into that box. There should be three items that pop up.
You can see from playing around with this that filters can be a powerful tool to use when looking at data.
Now a challenge for you. Using this spreadsheet find the following:
All requests that mention “Trump”.
The five requests that received the quickest responses.
The final reply date that was probably an entry error.
I’ll post the answers in my next entry.