Yesterday midnight, when I enjoyed watching X-files in the living room, a call disturbed me. That is my friend who is in the trouble of doing some data management in her dissertation. She has some character variables which looks like:



A/A

C/C

G/G

T/T



She needs to create two new variables. The first new variable contains the first word in old variable, and the second variable contains the last word in old variable. Both of them should also be converted into numeric variables. The criteria is A for 1, C for 2, G for 3, and T for 4.



Her first request is very easy, we can use a special function named "SUBSTR" to extract some word in specific location. The names of two new variables should be names as, for example:



old variable = rs1234567

1st new variable = rs1234567a

2nd new variable = rs1234567b



That's not difficult as well because we can just define the new names in data procedure. However, she said, "I have 3000 variables."



That's is a very very big dataset. If she define new variables by herself in SAS, she needs to write 6000 new variables. She thought a SAS macro could probably save her time, but it is still not efficient. Finally, I found out a good way to finish her job.



Step 1: Export her dataset into Excel, and the first line of the Excel file lists all variable names.

Step 2: Cut down those 3000 variables, and create a new dataset with 3000 temporary variables containing those 3000 variables.

Step 3: Put all 3000 temporary variables in an array named "old", and create 6000 new variables in another two array named "newa" and "newb".

Step 4: Use "Do...End" to concatenate 3000 old variables with "a" and "b", then put them into array newa and newb, respectively.

Step 5: Export this dataset into Excel, so all 6000 new variables are created.



The following program is a simplified example with only 5 old variables:



data test;

input (old1-old5) ($) ;

array old[5] $ old1-old5;

array newa[5] $12 newa1-newa5; /*adjust the lenght of new variables longer*/

array newb[5] $12 newb1-newb5;

do i=1 to 5;

newa[i]=right(old[i])||"a"; /*keep old variable right to prevent from a gap between old variable and a*/

newb[i]=right(old[i])||"b";

end;

keep newa1-newa5 newb1-newb5;

cards;

rs12 rs34 rs56 rs78 rs90

run;

proc print;run;

proc export data=test outfile="C:\Temp\newtest.xls"; run;

quit;












arrow
arrow
    全站熱搜

    cchien 發表在 痞客邦 留言(1) 人氣()