SPSS equivalent of bysort and

Discussion:

SPSS equivalent of bysort and _n in Stata

(too old to reply)

t***@gmail.com

2016-04-15 13:35:02 UTC

Hi all

I'm new to SPSS and struggling with syntax to drop duplicate cases. The duplicates are not the same in all variables - they have been double entered but on different dates. I want to exclude all but the first record (by date of entry) for each patient. In Stata I would do this using the following syntax:

/* sort by the identifying variables (PatientID for simplicity) and by date of data entry */
sort PatientID DateOfDataEntry

/* drop all duplicate records of PatientID except the first one */
by PatientID: drop if _n > 1

Is there a way of doing this in SPSS?

Thanks in advance for your help.

Tristan

Bruce Weaver

2016-04-15 18:00:23 UTC

Permalink

Post by t***@gmail.com
Hi all
/* sort by the identifying variables (PatientID for simplicity) and by date of data entry */
sort PatientID DateOfDataEntry
/* drop all duplicate records of PatientID except the first one */
by PatientID: drop if _n > 1
Is there a way of doing this in SPSS?
Thanks in advance for your help.
Tristan

I am occasional enough user of Stata that I recognize that _n system variable. Both it and _N look very useful! (People unfamiliar with Stata can read about them here: http://www.ats.ucla.edu/stat/stata/notes/countn.htm.)

Meanwhile, I think something like the following should do the trick for you in SPSS.

* Sort by the identifying variables and by date of data entry.
SORT CASES BY PatientID DateOfDataEntry.

* Drop all duplicate records of PatientID except the first one.
* Use system variable $CASENUM to identify the first case in the file.
SELECT IF ($CASENUM EQ 1) or (PatientID NE LAG(PatientID)).
EXECUTE.

Testing on a small dataset:

DATA LIST list / PatientID (F5.0) DateOfDataEntry (DATE11).
BEGIN DATA
1 15-Apr-2016
1 14-Apr-2016
1 15-Apr-2015
2 15-Apr-2014
2 15-Apr-2016
3 15-Apr-2016
END DATA.

SORT CASES BY PatientID DateOfDataEntry.
SELECT IF ($CASENUM EQ 1) or (PatientID NE LAG(PatientID)).
LIST.

OUTPUT:
PatientID DateOfDataEntry

1 15-APR-2015
2 15-APR-2014
3 15-APR-2016

HTH.

a***@berkeley.edu

2020-03-10 18:05:09 UTC

Permalink

Hi, I am having a similar issue I think. I would like to remove cases where there is not a Matched ID for both pre and post. I have respondents who have a unique ID. If a respondent has completed a pre and a post, there would be two lines, both having the same IDCODE and then another variable indicating pre or post. (pre=1 and post=2). How would I remove those who don't have a pre and post and matched ID? Thank you! New to SPSS.

Rich Ulrich

2020-03-10 20:54:19 UTC

Permalink

Post by a***@berkeley.edu
Hi, I am having a similar issue I think. I would like to remove cases where there is not a Matched ID for both pre and post. I have respondents who have a unique ID. If a respondent has completed a pre and a post, there would be two lines, both having the same IDCODE and then another variable indicating pre or post. (pre=1 and post=2). How would I remove those who don't have a pre and post and matched ID? Thank you! New to SPSS.

Do you know that no cases have more than one Pre
or Post record? - duplicated, or otherwise?

For a relatively well-composed file, the simplest
cleaning that I think of is to use AGGREGATE on ID,
ADDing a new variable to each line that has the
Number of cases for the ID; then SELECT to keep
only those with exactly 2 records. Before Selecting,
you can do a Freq to find out whether it is true that
all IDs have either 1 or 2 records, and never more than that.

For a messier file, without warning about the messiness,
you can use the LEAD( ) function as I will describe. LEAD(PrePost)
will return the value of PrePost in the record that comes next,
in contrast to the LAG( ) function that returns the value of
a variable from the previous record.

COMMENT file is sorted by ID and PrePost.
COMMENT - find a pair that make up two proper lines. Save both.
COMPUTE ToUse= 0.
IF (ID eq LEAD(ID) ) and (PrePost=1) and LEAD(PrePost=2) ToUse=1.
IF (ID eq LAG(ID) ) and (PrePost=2) and LAG(PrePost=1) ToUse= 1.
SELECT IF ToUse=1.
Save Outfile blah blah blah.

If you are using a really old version of SPSS, LEAD( ) might not
be available. In that case, you could use the LAG( ) line as above,
then re-Sort the file so that it is Descending order on PrePost,
then use the LEAD( ) line with each LEAD re-written as LAG( ).
Then do the Select.

Totally untested.

--
Rich Ulrich

Bruce Weaver

2020-03-10 22:34:18 UTC

Permalink

You didn't show some sample data, so it's difficult to know what the possibilities are. E.g., might someone have more than 2 lines? Here's a fairly general approach that should work under most circumstances. You'll have to change variable names as needed.

* Generate some data to illustrate.
DATA LIST list / ID Time (2F1).
BEGIN DATA
1 1
1 2
2 1
3 2
4 1
4 1
5 2
5 2
6 1
6 2
7 1
7 1
7 2
END DATA.

AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=ID
/Time_min=MIN(Time)
/Time_max=MAX(Time)
/NumRecs=NU.

COMPUTE ToUse = (NumRecs EQ 2) AND (Time_min EQ 1) and (Time_max EQ 2).
FORMATS ToUse(F1).
LIST ID Time ToUse.

SELECT IF ToUse.
LIST ID Time ToUse.

OUTPUT from first LIST command:

ID Time ToUse

1 1 1
1 2 1
2 1 0
3 2 0
4 1 0
4 1 0
5 2 0
5 2 0
6 1 1
6 2 1
7 1 0
7 1 0
7 2 0

Number of cases read: 13 Number of cases listed: 13

Output from second LIST command:

ID Time ToUse

1 1 1
1 2 1
6 1 1
6 2 1

Number of cases read: 4 Number of cases listed: 4