© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. I. KatzUp Up and Array!https://doi.org/10.1007/978-1-4842-8966-2_7

7. The FILTER Function: Cutting the Data Down to Size

Abbott Ira Katz1  
(1)
Edgware, UK
 

Your dataset may be an embarrassment of riches, but sometimes more is less. Every spreadsheet user knows that their data often warehouses more information than they need right now, and could do with a bit of incisive, if temporary, trimming.

And Excel’s new dynamic array FILTER function is very much up to that task. And as with SORT and SORTBY, FILTER contributes a set of formula-centered solutions to a classic spreadsheet chore that’s been handled heretofore by a potpourri of tools, for example, pivot table filters and slicers, the Advanced Filter option, and the standard-issue Filter you see atop tables and the like.

They all have their place, of course, but because FILTER does its thing with formulas, a vast catalog of criteria by which the data can be summoned becomes available, making FILTER a most powerful and exceedingly agile means for getting what you want from the data. FILTER is the dynamic array’s Swiss army knife – it can slice and dice the data in so many ways.

The Formula

FILTER comprises three arguments, the final one of which is optional (but useful):
  • =FILTER(range/array,include,if_empty)

The first argument, again officially dubbed “array” by Excel, calls for the dataset you want to filter, and of course accepts range or table names in lieu of cell coordinates.

It’s the second argument – “include” – from which FILTER’s power radiates. Start with an elementary filter (file: FILTER, adapted from fellow Apress author Debra Dalgliesh’s Contexture compendium of free downloads, Again, the dataset in its entirety is called All).

We want to filter the food-sales dataset for all sales conducted in Boston. Enter Boston in L1 and follow in L3 with
  • =FILTER(All,City=L1)

You should see as follows (again, in excerpt) per Figure 7-1.

A screenshot of an Excel sheet has 8 columns and 18 rows. The second cell of the first column is highlighted.

Figure 7-1

Bean counting: food sales in Beantown

(Again, as with SORT, the resulting spill range evinces basic, default formatting that you’d almost surely want to modify; and again, the results are header-free. That row could be copied and pasted atop the spill range.)

We’ve successfully isolated all the sales conducted in Boston, and with that achievement in hand let’s return to FILTER’s syntax. Here the “include” argument
  • City=L1

cites the name of the field doing the filtering and asks the dataset to search for whatever city name we’ve written to L1. You’ve doubtless inferred, then, that if we substitute New York for Boston in L1, all of the former city’s sales will be returned by the filter. And you could probably deduce in turn that if we were to enter Cookies in L1, we’d need to rewrite our formula in L4 to read
  • =FILTER(All,Category=L1)

because Cookies is an item populating that field.

These basic FILTER constructions are pretty straightforward but rather powerful just the same – because you can conduct rapid-fire analyses of the sales in any city or activity for any food category by simply substituting the desired item in L1. And notice that while our formulas have imposed a condition of sorts on the data – for example, filter a record only if its city matches Boston – FILTER criteria never require use of the word IF. You only enter the condition(s) itself – what IF formulas call a logical test – and nothing more. Thus FILTERs never ask you to post a criterion – that is, an “include” argument – that begins
  • =IF(Category=L1…

Now About That [If Empty] Argument

Before we explore more of FILTER’s most important criteria variations, let’s not forget the function’s optional If Empty argument. If Empty plays a cosmetic role in the filtering enterprise, letting the user supply an info notification to the formula, in the event the filter turns up nothing. Thus, if I enter Tacos in L1 – an item nowhere to be found in the Category field – FILTER will default to an unsightly #CALC! exclamation. But if I edit FILTER to read
  • =FILTER(All,Category=L1,”Not found”)

That caption will replace the error message.

Multiple Filter Criteria: Getting More (or Less) Out of the Data

Now suppose we want to filter the data for Cookies sales in San Diego. That’s an objective, of course, that requires two criteria – one each drawn from the Category and City fields. With that two-pronged intention in mind we can now enter San Diego in M1, and go on to edit our filter formula to read
  • =FILTER(All,(Category=L1)*(City=M1),“Not found”)

The formula has assumed two new elements, both of which are integral to a multi-criteria filter. First, once a second (or third, etc.) criterion enters the formula, each criterion must be surrounded by its own pair of parentheses. Second, recognize that we’ve in effect written an AND statement. That is, we’re insisting that the formula nail down all the records in which the Category field reports the entry Cookie and the City field lists San Diego. The means for stipulating “and” in a filter formula is the asterisk, wedged in between the criteria. And so if you needed to import a third criteria, the syntax would look like this:
  • (1st criterion)*(2nd criterion)*(3rd criterion)

(In fact, the asterisk – Excel’s multiplier operator – is actually performing an act of multiplication inside the formula, too, but we can skim past the details of that maneuver here.)

Once we’ve gotten all that punctuation in place, we should see the following, per Figure 7-2.

A screenshot of an Excel sheet has 8 columns and 18 rows. The second cell of the first column is highlighted.

Figure 7-2

San Diego snacking: a two-criteria filter

OR Consider This Alternative

Now let’s assume instead that you want to filter all the food sales for both San Diego and Los Angeles, your west coast accounts. That is, we’re seeking records for sales conducted in either San Diego or Los Angeles. If we enter Los Angeles in L1 to accompany the San Diego in M1, and amend our formula to
  • =FILTER(All,(City=L1)+(City=M1),“Not found”)

You should see the following, as in Figure 7-3.

A screenshot of an Excel sheet has 8 columns and 18 rows. The second cell of the first column is highlighted.

Figure 7-3

California comestibles: food sales in Los Angeles and San Diego

What clearly distinguishes the multi-criteria “or” filter from an “and” is a swap of the asterisk for the plus sign – nothing more.

Mix and Match: AND and OR Together

Next question: what if you want to filter for all the Cookies sales in both Los Angeles and San Diego? That’s a tasty problem, to be sure, requisitioning as it does both “and” as well as “or” criteria. Here, I want the filter to detect sales compiled in either Los Angeles or San Diego – but only of Cookies. How is this one written?

Like this:
  • =FILTER(All,((City=L1)+(City=M1))*(Category=N1),“Not found”)

An extra brace of parentheses now hugs both halves of the “or” expression, treating it all as a unitary result that proceeds to work with the “and” statement. If you inadvertently forget the second parentheses around the two City references, the formula will first consider criteria 1 and 2 alone, and then criteria 2 and 3 alone, thus mining outcomes that won’t necessarily fulfill all the criteria at the same time (but if you’ve been clicking along, you’ll note that all the San Diego entries will be coupled with Cookies, for example,
  • (City=M1))*(Category=N1)

because those pairings always call upon criteria 2 and 3, as they’re adjacent. But the Los Angeles records aren’t invariably matched with Cookies, because that city occupies criterion 1, and isn’t directly associated with criterion 3, the Category reference.)

In any case, once you put every bit and piece in its place Figure 7-4 lets you know what you should be seeing the following.

A screenshot of an Excel sheet has 8 columns and 18 rows. The second cell of the first column is highlighted.

Figure 7-4

Cal. Counting: Cookies sales in Los Angeles and San Diego

Now let’s return to our inaugural FILTER formula, the one in which we returned the sales data for Boston. By definition, then, all our results there must disclose that city’s name in the City field – and precisely because that field will contain nothing but Boston, we don’t really need all that redundant information. As a result, and for good presentational reasons, we might want to eject the City field from this filter – and in order to make good on that intention we can reprise the INDEX formula we dusted off in Chapter 5, in the course of our perusal of the UNIQUE function.

Here, our formula reads

=FILTER(INDEX(All,SEQUENCE(ROWS(All)),{1,2,4,5,6,7,8}),City=M1,”Not found”)

Don’t be fooled or confused; in spite of its density, the filter here consists of the same three arguments with which we’ve been working all along. The INDEX formula is assembling an improvised dataset, one appropriating the All minus the City field – the third column in All, thus explaining the missing 3 in the bracketed array. Note in addition the SEQUENCE function here that’s performing the same work it did in the UNIQUE formula – it’s counting, and hence including, all of the rows in the All dataset.

And once you’re satisfied with your formula, the results should follow corroborated by Figure 7-5.

A screenshot of an Excel sheet has 7 columns and 18 rows. The second cell of the first column is highlighted.

Figure 7-5

Something’s missing – on purpose. The City field isn’t there.

All those Bostons are gone; and for good measure, we could have zapped the Region field, too – the dataset’s second field – from the formula, by excising the 2 from the bracketed array – because, after all, Boston is always in the East.

Note

Our discussion of the new dynamic array function DROP in a later chapter will describe a considerably more elegant way in which this result can be achieved

How About Filtering Part of a Cell?

Thus far, we’ve learned about filtering fields whose data consists of single words, for example, Cookies, or Boston. But what about cells consisting of what’s typically called unstructured text – that is, multi-word content taking up residence in a single cell address? Can FILTER help us to unearth a keyword, or selected words from the larger text sitting in a cell?

Well, the answer is a decisive yes. FILTER can masterfully extract partial text excerpts from a range of cells, and for a case in point break open the FILTER – text search file, an adaptation of a spreadsheet that enumerates the current members of the House of Representatives (source: Everypolitican (https://everypolitician.org/united-states-of-america/house/download.html)).

Suppose for starters we want to filter the name field for all Representatives named John. The problem, of course, is that all the names are bunched into single cells – first, last, and middle names – and we need to somehow uproot “John” from the lengthier entries into which the name has been gathered.

To give that assignment a go we’ll tap into the SEARCH function, an old but robust means for discovering the morsels of text cooking in a larger stew of words. SEARCH is what I call a positional function, because it delivers a number corresponding to the position in a cell at which the searched text begins, assuming it’s found. Thus if I search for the word Thursday in the cell entry “Today is Thursday” SEARCH will offer up 10, the position number of the T in Thursday. It’s written as follows:
  • =SEARCH(find_text,within_text,start_number)

The start number argument is optional.

Thus if “Today is Thursday” is written to A6 and we enter Thursday in A2, this formula
  • =SEARCH(A2,A6)

will pump out 10. If the searched text isn’t there, a #VALUE! error message will stake the cell instead. Remember, of course, that we’re about to apply SEARCH to an entire range of names, but we’re in dynamic array mode, and that’s something we can do.

Now back to our task. In the FILTER tab of the FILTER – text search workbook enter John in B2. In B4, write
  • =FILTER(name,ISNUMBER(SEARCH(B2,name)),“Not found”)

The ISNUMBER function wrapping itself around SEARCH tests a value or formula for its numeric status; thus if a search of each cell in the name range yields a number – meaning that the search term has been found – that cell will instate itself into the filtered result.

Our FILTER thus looks like this, per Figure 7-6.

A screenshot of an Excel sheet has 4 columns and 21 rows. The entries are in the first column and the third cell of the first column is highlighted.

Figure 7-6

Restricted entry: Congresspersons with John in their name – somewhere in their name

That works – sort of. But we see that our filter has garnered a trio of Johnsons – which also contain the text John – that we probably didn’t have in mind. A workaround: enter in B2 instead

John[space]

That search term will close the door on the Johnsons, simply because the John in their names is succeeded by a character other than [space]. That second take will yield as follows, illustrated in Figure 7-7.

A screenshot of an Excel sheet has 4 columns and 16 rows. The entries are in the first column and the third cell of the first column is highlighted.

Figure 7-7

That’s more like it: nothing but John

Thus we see that FILTER does a superior job of performing text/keyword searches of free, or unstructured, text – a most flexible and dynamic capability.

Note

Some approaches to dividing last and first names into distinct cells will be explored when we review Excel’s newest dynamic array functions TEXTBEFORE and TEXTAFTER.

There’s More

In fact, FILTER’s potential is in large measure curtailed only by the smarts you bring to your formulas. Another example, just for demo’s sake: suppose we wanted to be able to return all the names of Congresspersons starting with a particular letter. Say now I enter B in B2 and rewrite B4’s formula to read
  • =FILTER(name,LEFT(name,1)=B2)

Here the LEFT function peels the first letter from each name, and FILTER takes over and winnows the ones starting with the initial you’ve entered in B2, recorded in Figure 7-8.

A screenshot of an Excel sheet has 4 columns and 20 rows. The entries are in the first column and the third cell of the first column is highlighted.

Figure 7-8

Initial success

And remember, in addition to many other possibilities, I could write in another cell
  • =COUNTA(B4#)

to deliver a total of members of Congress with names beginning with the designated letter.

FILTER’s Number Wizardry Adds Values to Your Values

And of course, FILTER is at your service should you need to count the numbers. By way of a simple example, let’s call up a familiar-looking dataset, here stored in the FILTER: Values file.

We want, at least by way of a first look, to filter all order amounts there equalling or exceeding $2,000. Let’s enter 2000 in cell H3 and follow in H5 with
  • =FILTER(All,Order_Amount>=H3)

That expression should be pretty clear. We’re attempting to filter the entire dataset All by order amounts that equal or surpass the value we’ve entered in H3. Working with 2000 you should see the following, per Figure 7-9.

A screenshot of an Excel sheet has 5 columns and 14 rows. The third cell of the first column is highlighted.

Figure 7-9

Making the number: sales equalling or bettering $2000

(Remember that the greater and less-than operators always precede the = sign in Excel formulas.)

Now for another value-driven task, what if you need to study all the sales conducted, say, between January 1, 2020, and June 30 of that year – that is, a 6-month summary?

Ok. This time enter 1/1/2020 in H3 and follow in I3 with 6/30/2020. Write, in H5
  • =FILTER(all,(Order_Date>=H3)*(Order_Date<=I3))

Again, this is an AND formula, because the filtered dates must satisfy both criteria. And as such, each criterion need be flanked by parentheses and separated by the asterisk befitting “and” filters. Write that formula and you should see the following (I’ve formatted the Order_Date in order to clarify the results here) (viewed in excerpt, in Figure 7-10).

A screenshot of an Excel sheet has 5 columns and 19 rows. The second cell of the first column is highlighted.

Figure 7-10

Half a year’s worth of sales

Note

You can also enter date criteria via this slightly more ponderous method, for example, =FILTER(All,Order_Date=VALUE(“6/10/2013”))

No More Blank Looks About Blank Rows

If you’re looking for a data-quality conundrum that can make the day of any spreadsheet analyst big time, start with blank rows. These nullities are the bane of conventional sorting, pivot tables, and traditional filters, but once you get with the FILTER program you can actually do real work – even if you leave those blanks in place.

By way of example, here’s a modification of a spreadsheet (File: Filter – working with blank rows) released by a New York City agency (for the original file go here https://media.githubusercontent.com/media/thecityny/housing-lottery-data/master/housing-lotteries.csv)), one that itemizes public housing apartments offered up via a lottery. Its blank rows as you see them, however, came that way in the original sheet. So what can we do? (As usual, the dataset is entitled All, and the field names derive from the header row.)

Here’s what. Suppose we want to filter the Rent field for all properties priced at $800 a month or less. Turn to the Filter worksheet (note that in the interests of clarity I’ve proactively pasted the headers in row 6, to point to the Rent field). Enter, in B7 and as embodied by Figure 7-11.
  • =FILTER(All,Rent<=B3)

A screenshot of an Excel sheet has 21 rows and 16 columns. The fifth cell from the first column is highlighted.

Figure 7-11

Probably not what you had in mind: Rent-free properties?

And we get the following, in excerpt, in Figure 7-11.

Now that doesn’t look right, does it? But the problem is that the filter was constructed correctly; it has properly turned up all the rents that meet our criterion – equal to or less than $800. But remember that every other row in our data set is blank, and as such their Rent field naturally evaluates to 0 – and 0 is less than 800. What we want, then, is a formulaic take two, something like this:
  • =FILTER(All,(Rent<=B3)*(Rent<>0))

yielding as follows, in Figure 7-12.

A screenshot of an Excel sheet has 19 rows and 16 columns. The third cell from the first column is highlighted.

Figure 7-12

That’s more like it

Coming Next

Sure, filtering a dataset half of whose rows contain absolutely nothing in them sounds a touch gimmicky, but you get the idea – and besides, we’ve been working here with an actual spreadsheet. The FILTER function can crunch a subset of your data that meets just about any specification, and the more specifications you can imagine, the more powerful the function. Next up is a very different dynamic array function that performs a very different set of tasks – RANDARRAY.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.95.7