grouping and selecting cases?

Discussion:

(too old to reply)

Elyse Cottrell-Martin

2019-06-05 13:58:35 UTC

I have stacked data that I want to do a few things with.

For most people they have 3 sets of data (before, right after, and a follow up for an event) but some do not.

example:

ID# Date Score Time
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup

I only want to include people who have a before, after, AND followup data point (so in the above I would want to exclude ID#2 and ID#4 because they only have two (some people only have 1, and some people have 3 but they're not the right time code, for example they have an After and 2 followups). I have over 3000 lines of data though so I need syntax or how to do this using the GUI.

I think this is a two step process? Grouping IDs and then selecting cases? I know how to group data using aggregate, and I know how to select cases, but I can't seem to figure out how to combine these things.

Help is greatly appreciated, thank you.

Rich Ulrich

2019-06-06 04:55:00 UTC

Permalink

On Wed, 5 Jun 2019 06:58:35 -0700 (PDT), Elyse Cottrell-Martin

Post by Elyse Cottrell-Martin
I have stacked data that I want to do a few things with.
For most people they have 3 sets of data (before, right after, and a follow up for an event) but some do not.
ID# Date Score Time
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup
I only want to include people who have a before, after, AND followup data point (so in the above I would want to exclude ID#2 and ID#4 because they only have two (some people only have 1, and some people have 3 but they're not the right time code, for example they have an After and 2 followups). I have over 3000 lines of data though so I need syntax or how to do this using the GUI.
I think this is a two step process? Grouping IDs and then selecting cases? I know how to group data using aggregate, and I know how to select cases, but I can't seem to figure out how to combine these things.
Help is greatly appreciated, thank you.

I might do this as two steps:
Save to a file the ID#'s that are complete;
Do a /TABLE= <orig> match to pick up those cases.

The new file will have extra Followup, if those occurs.
You can get rid of them with LAG if you want.

This is all untested. It assumes that there is never more than
one "After" for an ID#, and that the file is sorted in logical date
order (Before, after, followup). And that the capitalization is
always correct.

DO IF (time eq "Followup") and
+ (lag(ID#) = ID# ) and (lag( ID#,2) = ID#).

DO IF (lag(Time) eq "After") and (lag(Time,2) eq "Before").
SAVE outfile= <IDfile>/vars= ID#.
END IF.
END IF.

MATCH FILES file= <IDfile name>/table= * / by= ID#/ vars= all.

--
Rich Ulrich

Bruce Weaver

2019-06-06 15:09:43 UTC

Permalink

Post by Rich Ulrich
On Wed, 5 Jun 2019 06:58:35 -0700 (PDT), Elyse Cottrell-Martin

Save to a file the ID#'s that are complete;
Do a /TABLE= <orig> match to pick up those cases.
The new file will have extra Followup, if those occurs.
You can get rid of them with LAG if you want.
This is all untested. It assumes that there is never more than
one "After" for an ID#, and that the file is sorted in logical date
order (Before, after, followup). And that the capitalization is
always correct.
DO IF (time eq "Followup") and
+ (lag(ID#) = ID# ) and (lag( ID#,2) = ID#).
DO IF (lag(Time) eq "After") and (lag(Time,2) eq "Before").
SAVE outfile= <IDfile>/vars= ID#.
END IF.
END IF.
MATCH FILES file= <IDfile name>/table= * / by= ID#/ vars= all.
--
Rich Ulrich

I cannot currently see the original post. But here is what comes to mind for me.

NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F2.0) Date(ADATE) Score (F5.1) Time(A8).
BEGIN DATA
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup
END DATA.

* Compute 0/1 indicator variables for the 3 time points.
COMPUTE t1 = Time EQ "Before".
COMPUTE t2 = Time EQ "After".
COMPUTE t3 = Time EQ "Followup".
FORMATS t1 to t3 (F1).

* Write max values of indicator to each row per ID.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
/BREAK=ID
/t1 t2 t3 = MAX(t1,t2,t3).

* Keep records where sum of the indicators = 3.
SELECT IF SUM(t1 to t3) EQ 3.
LIST.
DELETE VARIABLES t1 to t3.

Output from LIST:

ID Date Score Time t1 t2 t3

1 01/02/2015 1.2 Before 1 1 1
1 01/03/2015 1.5 After 1 1 1
1 02/05/2015 1.3 Followup 1 1 1
3 01/08/2015 1.5 Before 1 1 1
3 02/05/2015 .2 After 1 1 1
3 02/10/2015 1.1 Followup 1 1 1

Number of cases read: 6 Number of cases listed: 6

2019-06-06 18:02:12 UTC

Permalink

Post by Bruce Weaver

Post by Rich Ulrich
On Wed, 5 Jun 2019 06:58:35 -0700 (PDT), Elyse Cottrell-Martin

I cannot currently see the original post. But here is what comes to mind for me.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F2.0) Date(ADATE) Score (F5.1) Time(A8).
BEGIN DATA
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup
END DATA.
* Compute 0/1 indicator variables for the 3 time points.
COMPUTE t1 = Time EQ "Before".
COMPUTE t2 = Time EQ "After".
COMPUTE t3 = Time EQ "Followup".
FORMATS t1 to t3 (F1).
* Write max values of indicator to each row per ID.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
/BREAK=ID
/t1 t2 t3 = MAX(t1,t2,t3).
* Keep records where sum of the indicators = 3.
SELECT IF SUM(t1 to t3) EQ 3.
LIST.
DELETE VARIABLES t1 to t3.
ID Date Score Time t1 t2 t3
1 01/02/2015 1.2 Before 1 1 1
1 01/03/2015 1.5 After 1 1 1
1 02/05/2015 1.3 Followup 1 1 1
3 01/08/2015 1.5 Before 1 1 1
3 02/05/2015 .2 After 1 1 1
3 02/10/2015 1.1 Followup 1 1 1
Number of cases read: 6 Number of cases listed: 6

This code will also work.
Ki

*****************************************************************************
*This is from short form to long to filter only those with 3 points of data.
*****************************************************************************.

CASESTOVARS
/ID=id
/GROUPBY=VARIABLE.

select if not missing(score.1) and not missing(score.2) and not missing(score.3).
execute.

VARSTOCASES
/MAKE date FROM date.1 TO date.3
/MAKE score FROM score.1 TO score.3
/MAKE time FROM time.1 TO time.3
/KEEP id .

Rich Ulrich

2019-06-07 02:19:26 UTC

Permalink

Post by Ki

Post by Bruce Weaver

Post by Rich Ulrich
On Wed, 5 Jun 2019 06:58:35 -0700 (PDT), Elyse Cottrell-Martin

I cannot currently see the original post. But here is what comes to mind for me.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F2.0) Date(ADATE) Score (F5.1) Time(A8).
BEGIN DATA
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup
END DATA.
* Compute 0/1 indicator variables for the 3 time points.
COMPUTE t1 = Time EQ "Before".
COMPUTE t2 = Time EQ "After".
COMPUTE t3 = Time EQ "Followup".
FORMATS t1 to t3 (F1).
* Write max values of indicator to each row per ID.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
/BREAK=ID
/t1 t2 t3 = MAX(t1,t2,t3).
* Keep records where sum of the indicators = 3.
SELECT IF SUM(t1 to t3) EQ 3.
LIST.
DELETE VARIABLES t1 to t3.
ID Date Score Time t1 t2 t3
1 01/02/2015 1.2 Before 1 1 1
1 01/03/2015 1.5 After 1 1 1
1 02/05/2015 1.3 Followup 1 1 1
3 01/08/2015 1.5 Before 1 1 1
3 02/05/2015 .2 After 1 1 1
3 02/10/2015 1.1 Followup 1 1 1
Number of cases read: 6 Number of cases listed: 6

This code will also work.
Ki
*****************************************************************************
*This is from short form to long to filter only those with 3 points of data.
*****************************************************************************.
CASESTOVARS
/ID=id
/GROUPBY=VARIABLE.
select if not missing(score.1) and not missing(score.2) and not missing(score.3).
execute.
VARSTOCASES
/MAKE date FROM date.1 TO date.3
/MAKE score FROM score.1 TO score.3
/MAKE time FROM time.1 TO time.3
/KEEP id .

My solution keeps only the first Followup when the case is good
but has extra Followups. An obvious choice, but not the only one.

Bruce's solution keeps all Followups, which could be handy in
order to make alternate choices later on, but it is an odd file
organization -- it sort of requires selection, before anything
further is done.

I don't know what CASESTOVARS does when it has a duplicated
Followup (different date). I suspect it might toss out an error
or warning about file order, or else use the last Followup.

The OP gives no instructions/preference for that occurrence.

--
Rich Ulrich

2019-06-07 18:04:48 UTC

Permalink

Post by Rich Ulrich

Post by Ki

Post by Bruce Weaver

Post by Rich Ulrich
On Wed, 5 Jun 2019 06:58:35 -0700 (PDT), Elyse Cottrell-Martin

I cannot currently see the original post. But here is what comes to mind for me.
NEW FILE.
DATASET CLOSE ALL.
DATA LIST LIST / ID (F2.0) Date(ADATE) Score (F5.1) Time(A8).
BEGIN DATA
1 01/02/2015 1.2 Before
1 01/03/2015 1.5 After
1 02/05/2015 1.3 Followup
2 01/06/2015 2.3 Before
2 01/08/2015 1.4 After
3 01/08/2015 1.5 Before
3 02/05/2015 0.2 After
3 02/10/2015 1.1 Followup
4 03/20/2015 0.3 Followup
4 03/25/2015 1.2 Followup
END DATA.
* Compute 0/1 indicator variables for the 3 time points.
COMPUTE t1 = Time EQ "Before".
COMPUTE t2 = Time EQ "After".
COMPUTE t3 = Time EQ "Followup".
FORMATS t1 to t3 (F1).
* Write max values of indicator to each row per ID.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
/BREAK=ID
/t1 t2 t3 = MAX(t1,t2,t3).
* Keep records where sum of the indicators = 3.
SELECT IF SUM(t1 to t3) EQ 3.
LIST.
DELETE VARIABLES t1 to t3.
ID Date Score Time t1 t2 t3
1 01/02/2015 1.2 Before 1 1 1
1 01/03/2015 1.5 After 1 1 1
1 02/05/2015 1.3 Followup 1 1 1
3 01/08/2015 1.5 Before 1 1 1
3 02/05/2015 .2 After 1 1 1
3 02/10/2015 1.1 Followup 1 1 1
Number of cases read: 6 Number of cases listed: 6

This code will also work.
Ki
*****************************************************************************
*This is from short form to long to filter only those with 3 points of data.
*****************************************************************************.
CASESTOVARS
/ID=id
/GROUPBY=VARIABLE.
select if not missing(score.1) and not missing(score.2) and not missing(score.3).
execute.
VARSTOCASES
/MAKE date FROM date.1 TO date.3
/MAKE score FROM score.1 TO score.3
/MAKE time FROM time.1 TO time.3
/KEEP id .

My solution keeps only the first Followup when the case is good
but has extra Followups. An obvious choice, but not the only one.
Bruce's solution keeps all Followups, which could be handy in
order to make alternate choices later on, but it is an odd file
organization -- it sort of requires selection, before anything
further is done.
I don't know what CASESTOVARS does when it has a duplicated
Followup (different date). I suspect it might toss out an error
or warning about file order, or else use the last Followup.
The OP gives no instructions/preference for that occurrence.
--
Rich Ulrich

Hi Rich,

I think the code will still work for more than 3 assessments but really depends on the criteria to select a valid case. CASESTOVARS basically will create an "n" set of variables for "n" encounters or assessments.

It will definitely need an update in the syntax if the data structure or selection criteria changes in the second part of the code after CASESTOVARS.

Ki