Discussion:
SPSS equivalent of bysort and _n in Stata
(too old to reply)
t***@gmail.com
2016-04-15 13:35:02 UTC
Permalink
Hi all

I'm new to SPSS and struggling with syntax to drop duplicate cases. The duplicates are not the same in all variables - they have been double entered but on different dates. I want to exclude all but the first record (by date of entry) for each patient. In Stata I would do this using the following syntax:


/* sort by the identifying variables (PatientID for simplicity) and by date of data entry */
sort PatientID DateOfDataEntry

/* drop all duplicate records of PatientID except the first one */
by PatientID: drop if _n > 1


Is there a way of doing this in SPSS?

Thanks in advance for your help.

Tristan
Bruce Weaver
2016-04-15 18:00:23 UTC
Permalink
Post by t***@gmail.com
Hi all
/* sort by the identifying variables (PatientID for simplicity) and by date of data entry */
sort PatientID DateOfDataEntry
/* drop all duplicate records of PatientID except the first one */
by PatientID: drop if _n > 1
Is there a way of doing this in SPSS?
Thanks in advance for your help.
Tristan
I am occasional enough user of Stata that I recognize that _n system variable. Both it and _N look very useful! (People unfamiliar with Stata can read about them here: http://www.ats.ucla.edu/stat/stata/notes/countn.htm.)

Meanwhile, I think something like the following should do the trick for you in SPSS.

* Sort by the identifying variables and by date of data entry.
SORT CASES BY PatientID DateOfDataEntry.

* Drop all duplicate records of PatientID except the first one.
* Use system variable $CASENUM to identify the first case in the file.
SELECT IF ($CASENUM EQ 1) or (PatientID NE LAG(PatientID)).
EXECUTE.

Testing on a small dataset:

DATA LIST list / PatientID (F5.0) DateOfDataEntry (DATE11).
BEGIN DATA
1 15-Apr-2016
1 14-Apr-2016
1 15-Apr-2015
2 15-Apr-2014
2 15-Apr-2016
3 15-Apr-2016
END DATA.

SORT CASES BY PatientID DateOfDataEntry.
SELECT IF ($CASENUM EQ 1) or (PatientID NE LAG(PatientID)).
LIST.

OUTPUT:
PatientID DateOfDataEntry

1 15-APR-2015
2 15-APR-2014
3 15-APR-2016

HTH.
a***@berkeley.edu
2020-03-10 18:05:09 UTC
Permalink
Hi, I am having a similar issue I think. I would like to remove cases where there is not a Matched ID for both pre and post. I have respondents who have a unique ID. If a respondent has completed a pre and a post, there would be two lines, both having the same IDCODE and then another variable indicating pre or post. (pre=1 and post=2). How would I remove those who don't have a pre and post and matched ID? Thank you! New to SPSS.
Rich Ulrich
2020-03-10 20:54:19 UTC
Permalink
Post by a***@berkeley.edu
Hi, I am having a similar issue I think. I would like to remove cases where there is not a Matched ID for both pre and post. I have respondents who have a unique ID. If a respondent has completed a pre and a post, there would be two lines, both having the same IDCODE and then another variable indicating pre or post. (pre=1 and post=2). How would I remove those who don't have a pre and post and matched ID? Thank you! New to SPSS.
Do you know that no cases have more than one Pre
or Post record? - duplicated, or otherwise?

For a relatively well-composed file, the simplest
cleaning that I think of is to use AGGREGATE on ID,
ADDing a new variable to each line that has the
Number of cases for the ID; then SELECT to keep
only those with exactly 2 records. Before Selecting,
you can do a Freq to find out whether it is true that
all IDs have either 1 or 2 records, and never more than that.

For a messier file, without warning about the messiness,
you can use the LEAD( ) function as I will describe. LEAD(PrePost)
will return the value of PrePost in the record that comes next,
in contrast to the LAG( ) function that returns the value of
a variable from the previous record.

COMMENT file is sorted by ID and PrePost.
COMMENT - find a pair that make up two proper lines. Save both.
COMPUTE ToUse= 0.
IF (ID eq LEAD(ID) ) and (PrePost=1) and LEAD(PrePost=2) ToUse=1.
IF (ID eq LAG(ID) ) and (PrePost=2) and LAG(PrePost=1) ToUse= 1.
SELECT IF ToUse=1.
Save Outfile blah blah blah.

If you are using a really old version of SPSS, LEAD( ) might not
be available. In that case, you could use the LAG( ) line as above,
then re-Sort the file so that it is Descending order on PrePost,
then use the LEAD( ) line with each LEAD re-written as LAG( ).
Then do the Select.

Totally untested.
--
Rich Ulrich
Bruce Weaver
2020-03-10 22:34:18 UTC
Permalink
Post by a***@berkeley.edu
Hi, I am having a similar issue I think. I would like to remove cases where there is not a Matched ID for both pre and post. I have respondents who have a unique ID. If a respondent has completed a pre and a post, there would be two lines, both having the same IDCODE and then another variable indicating pre or post. (pre=1 and post=2). How would I remove those who don't have a pre and post and matched ID? Thank you! New to SPSS.
You didn't show some sample data, so it's difficult to know what the possibilities are. E.g., might someone have more than 2 lines? Here's a fairly general approach that should work under most circumstances. You'll have to change variable names as needed.

* Generate some data to illustrate.
DATA LIST list / ID Time (2F1).
BEGIN DATA
1 1
1 2
2 1
3 2
4 1
4 1
5 2
5 2
6 1
6 2
7 1
7 1
7 2
END DATA.

AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=ID
/Time_min=MIN(Time)
/Time_max=MAX(Time)
/NumRecs=NU.

COMPUTE ToUse = (NumRecs EQ 2) AND (Time_min EQ 1) and (Time_max EQ 2).
FORMATS ToUse(F1).
LIST ID Time ToUse.

SELECT IF ToUse.
LIST ID Time ToUse.


OUTPUT from first LIST command:

ID Time ToUse

1 1 1
1 2 1
2 1 0
3 2 0
4 1 0
4 1 0
5 2 0
5 2 0
6 1 1
6 2 1
7 1 0
7 1 0
7 2 0

Number of cases read: 13 Number of cases listed: 13

Output from second LIST command:

ID Time ToUse

1 1 1
1 2 1
6 1 1
6 2 1

Number of cases read: 4 Number of cases listed: 4

Loading...