ABSTRACT

A data set often contains duplicate values across observations. Individual records can be duplicated or a subset of variables among records can have identical values. One way to eliminate duplicate records from a data set is to use FIRST.VARIABLE and LAST.VARIABLE within the DATA step, which was illustrated in Section 4.2.2. If you don’t want to alter the composition of your original data set, however, you can use the NODUPKEY or the NODUPRECS options in the SORT procedure and send the output to a second data set that is referenced in the OUT = option. The form for PROC SORT with these options is as follows:

10.1.1 Eliminating Observations with Duplicate BY Values

Consider the data set DUPLICATES, which contains three variables: ID, GRADE, and SCORE. Depending upon the BY variable being examined, the number of observations with the same BY values varies. For example, if ID is the BY variable, there are four observations (observations 1, 2, 4, and 6) with identical ‘A01’ values and three with identical ‘A02’ values. DUPLICATES:

When the NODUPKEY option in PROC SORT is used, observations with duplicate BY values are eliminated. PROC SORT sorts first and then compares BY values for a given observation to what is recorded in the previous observation. When a match is found, the current observation is not written to the output data set. Program 10.1 illustrates the different results by using different combinations of BY variables in PROC SORT. In the last PROC SORT, the keyword _ALL_ is used to sort all the variables in the input data set.