Selected Category: SAS programming (14)

View Mode: Post List Post Summary

 昨日在幫 Ann 完成 GEE model diagnostic 之後,她雖很心滿意足地回去繼續她的論文,但留了一個洞給我繼續挖。任何的 model diagnostic 中,每個觀測值都會產生一些統計值,如 Cook's D 或 Leverage,都需要畫一些圖表來顯示每個統計值的高低,讓讀者可以很明白地看出哪些值是 influential data。因此,除了要描點之外,還要特別把幾個 influential data 的編號標示出來。這種圖示我就不知道該怎樣去做了。因此,趁著寫感謝信給 Dr. John Pressier 之餘,順便在信中問他那種圖該怎樣畫(因為他在 paper 裡面秀了很多張這類的圖表)。但如我一開始的直覺假設,他那些圖果然是用 S-PLUS 弄出來的。這時要我去研究怎樣用 S-PLUS 去做圖是不可能的,因為我從上個世紀(也就是我念研究所的時代)就不喜歡用 S-PLUS,即便他後來的功能改的很強。因此,正當我在翻閱 SAS/GRAPH 手冊尋求任何配套程式寫法時,John 的另一位 co-author,Dr. Bradley Hammill,寄了一封信給我,教我如何用 SAS 來畫那種圖。原來只要用 POINTLABEL 這個指令就可以標示想要的數字出來(但 SAS/GRAPH 原文手冊裡面居然沒有列出這個指令!)。在此分享一下 Brad 提供的範例:



* Generate some random data for plotting;


data a;


    do idnum = 1 to 100;


        sres = rannor(23456);


        output;


    end;


run;





* Label all points in index plot with ID number;


symbol v=dot c=black pointlabel=(h=2pct "#idnum");





proc gplot data=a;


    plot sres * idnum;


run;





* Setup a new var that only has the ID number if abs(SRES) > 2;


data a;


    set a;


    if abs(sres) > 2 then idhigh = idnum;


    else idhigh = .;


run;





* Label only high-SRES points in index plot with ID number;


symbol v=dot c=black pointlabel=(h=2pct "#idhigh");





* W/out this option, missing labels in plot will appear as .;


options missing=" ";





proc gplot data=a;


    plot sres * idnum;


run;














Posted by cchien at 痞客邦 PIXNET Comments(3) Trackback(0) Hits(107)

在問卷調查中,為了避免輸入錯誤,比較有經費的計畫大都會一次雇用兩個人手做double
entry的程序,就是說一份問卷由不同的人各輸入一次,然後比較看這兩個人有沒有輸入錯誤。如果兩人的輸入完全一致,就比較有高度的可靠度來說這筆資料是正確地輸入的。當然也可能會發生兩人同時出錯在同一個地方,但這機率很小。




問題來了,輸入好的資料如何交由電腦來做自動比對。SAS很貼心地有一個PROC
COMPARE的程序可以進行交叉比對。這個工具很有威力的理由是,比對數值變數其實還是小case,更強的地方是可以比對文字變數,而且可以精確到計算出裡面有多少個typo。但其實我們不需要知道太詳盡,只要兩兩比對率沒有到100%,就可以肯定一定其中某個人有錯誤輸入。但SAS沒有辦法自動校正,畢竟連我們也不知道哪個人輸入的才是正確的版本,所以只能將有問題的資料重新調出來用人工修正。




可以在下面這個連結找到完整語法和幾個實用範例:



http://www.sussex.ac.uk/its/help/guides/sas/proc/z0057814.htm



如果是SPSS資料格式,可以用下面程式把SPSS檔案叫進SAS:



proc import datafile=xxx.sav out=xxx dbms=SAV replace;



run;















-----

Posted by cchien at 痞客邦 PIXNET Comments(0) Trackback(0) Hits(118)

Yesterday midnight, when I enjoyed watching X-files in the living room, a call disturbed me. That is my friend who is in the trouble of doing some data management in her dissertation. She has some character variables which looks like:



A/A

C/C

G/G

T/T



She needs to create two new variables. The first new variable contains the first word in old variable, and the second variable contains the last word in old variable. Both of them should also be converted into numeric variables. The criteria is A for 1, C for 2, G for 3, and T for 4.



Her first request is very easy, we can use a special function named "SUBSTR" to extract some word in specific location. The names of two new variables should be names as, for example:



old variable = rs1234567

1st new variable = rs1234567a

2nd new variable = rs1234567b



That's not difficult as well because we can just define the new names in data procedure. However, she said, "I have 3000 variables."



That's is a very very big dataset. If she define new variables by herself in SAS, she needs to write 6000 new variables. She thought a SAS macro could probably save her time, but it is still not efficient. Finally, I found out a good way to finish her job.



Step 1: Export her dataset into Excel, and the first line of the Excel file lists all variable names.

Step 2: Cut down those 3000 variables, and create a new dataset with 3000 temporary variables containing those 3000 variables.

Step 3: Put all 3000 temporary variables in an array named "old", and create 6000 new variables in another two array named "newa" and "newb".

Step 4: Use "Do...End" to concatenate 3000 old variables with "a" and "b", then put them into array newa and newb, respectively.

Step 5: Export this dataset into Excel, so all 6000 new variables are created.



The following program is a simplified example with only 5 old variables:



data test;

input (old1-old5) ($) ;

array old[5] $ old1-old5;

array newa[5] $12 newa1-newa5; /*adjust the lenght of new variables longer*/

array newb[5] $12 newb1-newb5;

do i=1 to 5;

newa[i]=right(old[i])||"a"; /*keep old variable right to prevent from a gap between old variable and a*/

newb[i]=right(old[i])||"b";

end;

keep newa1-newa5 newb1-newb5;

cards;

rs12 rs34 rs56 rs78 rs90

run;

proc print;run;

proc export data=test outfile="C:\Temp\newtest.xls"; run;

quit;












Posted by cchien at 痞客邦 PIXNET Comments(1) Trackback(0) Hits(148)

This is a new category in my blog which includes some practical issues in using SAS. Even though I use SAS for many years, I still can't say I totally understand SAS. The depth of SAS programming is far away from what we can imagine. The powerful of SAS is absolutely, but the complication of it also troubles many users. Many Taiwanese students in UNC-CH often ask some SAS problems. I'd be glade to help them because they also give me the chance to face some challenges that I didn't encounter before. I am sincere to share my solutions in this blog. It is a sort of record, but I rather regard it as a cumulative experience of using SAS in my life.












Posted by cchien at 痞客邦 PIXNET Comments(2) Trackback(0) Hits(31)

«1 2