Excel 2019 Bible

(singke) #1

Part V: Understanding Power Pivot and Power Query


you produce. It is for this reason that finding and removing duplicate records should be
your first priority when you receive a new data set.

Before you jump into your data set to find and remove duplicate records, it’s important
to consider how you define a duplicate record. To demonstrate this point, look at the
table shown in Figure 39.1, where you see 11 records. Out of the 11 records, how many are
duplicates?

FIGURE 39.1
Are there duplicate records in this table? It depends on how you define one.

If you were to define a duplicate record in Figure 39.1 as a duplication of just the SicCode,
you would find 10 duplicate records. That is, out of the 11 records shown, 1 record has a
unique SicCode, while the other 10 are duplications. Now, if you were to expand your defi-
nition of a duplicate record to a duplication of both SicCode and PostalCode, you would find
only two duplicates: the duplication of PostalCodes 77032 and 77040. Finally, if you were to
define a duplicate record as a duplication of the unique value of SicCode, PostalCode, and
CompanyNumber, you would find no duplicates.

This example shows that having two records with the same value in a column does not nec-
essarily mean you have a duplicate record. It’s up to you to determine which field or combi-
nation of fields will best define a unique record in your data set.

Once you have a clear idea of what field, or fields, best make up a unique record in your
table, you can remove duplicates easily by using the Remove Duplicates command.

Figure 39.2 illustrates the removal of duplicate rows based on three columns. Note the
importance of selecting the columns that define a duplicate. In this case, the combination
of Address, CompanyNumber, and CompanyName defines a duplicate record. These columns
are selected before right-clicking and selecting the Remove Duplicates command.

The Remove Duplicates command essentially looks for distinct values in the columns that you selected, and then it
removes all of the records it needs to eliminate to end up with a unique list of values. If you select only one column
before initiating the Remove Duplicates command, Power Query will use only the one column that you selected to
determine the unique list of values. This will undoubtedly remove too many records—records that are not really dupli-
cates. For this reason, it’s important to make sure you select all of the columns that define a duplicate.
Free download pdf