Yesterday midnight, when I enjoyed watching X-files in the living room, a call disturbed me. That is my friend who is in the trouble of doing some data management in her dissertation. She has some character variables which looks like:



A/A

C/C

G/G

T/T



She needs to create two new variables. The first new variable contains the first word in old variable, and the second variable contains the last word in old variable. Both of them should also be converted into numeric variables. The criteria is A for 1, C for 2, G for 3, and T for 4.



Her first request is very easy, we can use a special function named "SUBSTR" to extract some word in specific location. The names of two new variables should be names as, for example:



old variable = rs1234567

1st new variable = rs1234567a

2nd new variable = rs1234567b



That's not difficult as well because we can just define the new names in data procedure. However, she said, "I have 3000 variables."



That's is a very very big dataset. If she define new variables by herself in SAS, she needs to write 6000 new variables. She thought a SAS macro could probably save her time, but it is still not efficient. Finally, I found out a good way to finish her job.



Step 1: Export her dataset into Excel, and the first line of the Excel file lists all variable names.

Step 2: Cut down those 3000 variables, and create a new dataset with 3000 temporary variables containing those 3000 variables.

Step 3: Put all 3000 temporary variables in an array named "old", and create 6000 new variables in another two array named "newa" and "newb".

Step 4: Use "Do...End" to concatenate 3000 old variables with "a" and "b", then put them into array newa and newb, respectively.

Step 5: Export this dataset into Excel, so all 6000 new variables are created.



The following program is a simplified example with only 5 old variables:



data test;

input (old1-old5) ($) ;

array old[5] $ old1-old5;

array newa[5] $12 newa1-newa5; /*adjust the lenght of new variables longer*/

array newb[5] $12 newb1-newb5;

do i=1 to 5;

newa[i]=right(old[i])||"a"; /*keep old variable right to prevent from a gap between old variable and a*/

newb[i]=right(old[i])||"b";

end;

keep newa1-newa5 newb1-newb5;

cards;

rs12 rs34 rs56 rs78 rs90

run;

proc print;run;

proc export data=test outfile="C:\Temp\newtest.xls"; run;

quit;












創作者介紹

ToTo 奇妙の冒險

cchien 發表在 痞客邦 PIXNET 留言(1) 人氣()


留言列表 (1)

禁止留言
  • cchien
  • COMMENT:
    I suddenly think about a new problem. Because the number of

    columns in Excel is no more than 1000 (I forgot the exact

    number), a dataset with 3000 variables will be cut when

    converting into Excel. A possible solution is transporting this

    dataset by PROC TRANSPORT, then the first column of new dataset

    contains all variables. Under this situation, this program can

    be much simply organized by the following codes:





    proc transpose data=test2 out=test3 (keep=_name_);

    run;



    data test3;

    set test3;

    newa=right(_name_)||"a";

    newb=right(_name_)||"b";

    run;



    ods rtf file="c:\temp\test4.rtf";

    proc transpose data=test3 out=test4 (drop=_name_);

    var newa newb;

    run;

    proc print; run;

    ods rtf close;



    The final dataset will be exported as a rtf file, and it can be

    opened by Word. Because there is no limitation of length in Word,

    we can just copy/paste many times to put all new variables in

    new program.
    -----
找更多相關文章與討論