Yesterday midnight, when I enjoyed watching X-files in the living room, a call disturbed me. That is my friend who is in the trouble of doing some data management in her dissertation. She has some character variables which looks like:



A/A

C/C

G/G

T/T



She needs to create two new variables. The first new variable contains the first word in old variable, and the second variable contains the last word in old variable. Both of them should also be converted into numeric variables. The criteria is A for 1, C for 2, G for 3, and T for 4.



Her first request is very easy, we can use a special function named "SUBSTR" to extract some word in specific location. The names of two new variables should be names as, for example:



old variable = rs1234567

1st new variable = rs1234567a

2nd new variable = rs1234567b



That's not difficult as well because we can just define the new names in data procedure. However, she said, "I have 3000 variables."



That's is a very very big dataset. If she define new variables by herself in SAS, she needs to write 6000 new variables. She thought a SAS macro could probably save her time, but it is still not efficient. Finally, I found out a good way to finish her job.



Step 1: Export her dataset into Excel, and the first line of the Excel file lists all variable names.

Step 2: Cut down those 3000 variables, and create a new dataset with 3000 temporary variables containing those 3000 variables.

Step 3: Put all 3000 temporary variables in an array named "old", and create 6000 new variables in another two array named "newa" and "newb".

Step 4: Use "Do...End" to concatenate 3000 old variables with "a" and "b", then put them into array newa and newb, respectively.

Step 5: Export this dataset into Excel, so all 6000 new variables are created.



The following program is a simplified example with only 5 old variables:



data test;

input (old1-old5) ($) ;

array old[5] $ old1-old5;

array newa[5] $12 newa1-newa5; /*adjust the lenght of new variables longer*/

array newb[5] $12 newb1-newb5;

do i=1 to 5;

newa[i]=right(old[i])||"a"; /*keep old variable right to prevent from a gap between old variable and a*/

newb[i]=right(old[i])||"b";

end;

keep newa1-newa5 newb1-newb5;

cards;

rs12 rs34 rs56 rs78 rs90

run;

proc print;run;

proc export data=test outfile="C:\Temp\newtest.xls"; run;

quit;












cchien 發表在 痞客邦 留言(1) 人氣()

This is a new category in my blog which includes some practical issues in using SAS. Even though I use SAS for many years, I still can't say I totally understand SAS. The depth of SAS programming is far away from what we can imagine. The powerful of SAS is absolutely, but the complication of it also troubles many users. Many Taiwanese students in UNC-CH often ask some SAS problems. I'd be glade to help them because they also give me the chance to face some challenges that I didn't encounter before. I am sincere to share my solutions in this blog. It is a sort of record, but I rather regard it as a cumulative experience of using SAS in my life.












cchien 發表在 痞客邦 留言(2) 人氣()

I am struggling Marianne's case many weeks. I taught her how to use LRT to determine whether her reduced model is good, but the result always rejects null hypothesis. It doesn't make sense because all remaining independent variables are significant!!



I re-ran her program again, but have a good result that is totally different from Marianne's and doesn't reject null hypothesis. I requested her re-ran her SAS programs again, but she still had different results from mine.



After comparing our programs, I finally figured out what happened. That is, I keep all discrete independent variables in CLASS statement even though I don't use them in MODEL statement, but Marianne deleted them in both CLASS and MODEL statement, such as:



/*Marianne's program*/

proc genmod data=mb1.total;

class urban hospcode;

model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;

repeated subject=hospcode/type=exch;

run;



/*My program*/

proc genmod data=mb1.total;

class urban hospcode bedsize netwrk totalmagnet; <==here the difference is!!

model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;

repeated subject=hospcode/type=exch;

run;



I discuss with my supervisor, Mark, and he said both of the two programs are correct. Why are the results different? Because if we put variables in CLASS statement, SAS will delete all missing data in those variables, whatever the variables are in the model or not! On the contrary, if we don't keep those non-significant variables in CLASS statment, SAS will hold all data to fit the model.



Therefore, in likelihood ratio test, we need to delete all missing data and use the remaining completed data to yield likelihood function. After determining this reduced model is better than full model, then use whole dataset to fit final model, just as what Marianne's program shows!
















cchien 發表在 痞客邦 留言(2) 人氣()

I met with Marianne, a post doc in the School of Nursing in UNC-CH,
to discuss some remaining problems in her publication. Those
problems caused me confused as well, but my supervisor Mark didn't
have time to participate in our meeting last Friday. Hence, I stop
by Mark's office again today (Monday). There are some useful
conclusion after meeting with Mark.



The first question is whether Likelihood Ratio Test(LRT) is
suitable in comparing two models have the same independent
variables but some of them have differnt attributes. For example,
one variable is continuous in a Model, but how about using a
categorical one in the same model? That's the key point of this
question. The only confused thing is that LRT is only used in
nested model, but I am not sure whether this kind of situation is
the same. Mark told me it is because we can regard the continuse
one is a reduced model and the categorical one is full model in
that there are more variables in categorical one. Based on this
assumption, we can use LRT as usual.



The second question is more complicated. Marianne had already
finish the part of model selection, but just need a LRT to confirm
her final models are the best one. By using LRT, the decision
should be non-significant with large p-value, then we can have no
rejection of the null hypothesis which is reduced (final) model.
However, it is totally conflicted because the result is
significant. After checking the original SAS code, there is no
problem as well. However, Mark said, based on Marianne study
design, she needs to keep two important variables in this model
whatever it is significant or non-significant. After including the
two variables in this model, the conflict was eliminated. But, I
was still wondering whether one of them is highly overlapped with
another one because they are all geographic variables and have
highly similarity. I dropped out a less important one and fit the
model again, the result looked better. From this problem, we can
understand that we need to know more about variables before model
fitting, then we will decrease confusion from that.



The two solutions had already been emailed to Marianne. Hope she
will feel useful.












cchien 發表在 痞客邦 留言(2) 人氣()

June Cho, a Korean woman who is a postdoc in the School of Nursing in UNC-CH. I handled with her dissertation from 2004.DEC to 2005.MAY, and she graduated smoothly on 2005.JUL. Her husband is a professor in the School of Pharmacy. I guess they have been the U.S. citizens. After she graduated, she stay here to be her advisor's postdoc, and keep doing advanced research from her dissertation.



She wants to do a 2-way ANOVA to compare simple main effect in her current study. It's very easy, but she just needs my confirmation. I constructed a macro to her and she can just call this macro to fit all of her models (18 models). However, simple main effect is only used under the interaction term is significant. I only ran a model and the interaction term is significant, but I can predict not all of them have significant interaction terms. However, simple main effect is her only purpose of current research. How could we do it under non-significant interaction?



Regularly, I asked my supervisor, Mark. He said even though the interaction term is not significant, but we can still keep it in GLM model. Therefore, we con consist all results of simple main effect from those 18 models because all of them include interaction term. This could be a more suitable conclusion in discussion section.














-----

cchien 發表在 痞客邦 留言(0) 人氣()

In some statistical analysis, we'd like to test assumption of
normality in the beginning before analyzing. In univariate case, we
all understand Q-Q plot and some K-S statistic can be used to
assess normality. However, in multivariate normal distribution, how
about that?



Mardia's statistic is a test for multivariate normality. Based on
functions of skewness and kurtosis, Mardia's PK should be less than
3 to assume the assumption of multivariate normality is met. But,
whatever in SAS or SPSS, there is no easy way to use any statement
to perform it in any procedure.



In SAS, we need to use a macro procedure to calculate Mardia's PK
statistics. SAS Inc. released the codes on official website. Please
check the following link:



http://support.sas.com/ctx/samples/index.jsp?sid=480



Also, in SPSS, we need to use a macro to examine
bivariate/multivariate normality. Check it:



http://www.columbia.edu/~ld208/














-----

cchien 發表在 痞客邦 留言(0) 人氣()

Lindsey Austin, a master student (I guess) who works for a professor to be something (I am not sure whether she is a TA). Her professor requests her to analyze some records to see student's study ability. However, she is not good at statsitics, so she sent the data set to me.



The question is very easy: how to calculate the correlation between individual scores and GPA in reading, math, science, and fundamentals in some courses. The individual score variable is scale (0~100), but the GPA is ordinal (A+, A, A-,...., F).



In correlation analysis, there are three correlation coefficients we often use: Pearson, Kendall's tau, and Spearman. However, none of them are for the case of "scale vs ordinal".



I am wondering whether there are some special correlation coefficients that I don't know. I went to check SAS menu to see "PROC CORR", but there is no special correlation. My supervisor, Mark, even took his old handouts (because he also graduates from biostatics department in UNC-CH) to search for any evidence, but there is no way as well.



Finally, we conclude that, we can rank the individual score variable, and use Spearman correlation.



This is a pretty special case. I think there should be a specific correlation for this situation, but we haven't figure it out. If so, I will show here.












cchien 發表在 痞客邦 留言(2) 人氣()

I construct a new category to put some articles of my statistical
consulting experiences.



It's not only a memory, but also a record that I can retrospect
some special cases or problems if I encounter some similar
situations in the future.



I'll write these articles in my office and type in English. It
doesn't mean I am show off my English ability because there is no
Chinese input software in the computer in my office.



Don't feel picky in my grammar and spelling. I recognize my English
is poor. Well, if there is any suggestion from someone, welcome to
leave your message in relative articles.












cchien 發表在 痞客邦 留言(4) 人氣()

Marianne, a postdoc in the School of Nursing in UNC-CH. She lives in Virginia now. Her study is a comparison of urban and rural hospitals' nurse work environment and quality of care. I consult this case from last December.




The model she fits is a hierarchical regression model with generalized estimating equation (GEE). GEE method is very popular in model fitting when independent variables are highly correlated and there are many missing data in data set as well. It was created by Kung-Yi Liang, a Taiwanese who is a professor in the department of biostatistics in Johns Hopkins University.




Today, Marianne sent an email and show an error message in her SAS output. The problem is that the Hessian matrix is semi positive definite, and the program is terminated. I once encountered this situation before, but I don't know how to solve it so far. My supervisor, Mark, told me it is appeared when there are some empty cells in some categorical independent variables vs. categorical dependent variable. We can use PROC FREQ in SAS to check what IV has empty cells, then delete it.




As what Mark said, there is an empty cell in an IV. After deleting it, the output is shown without any error message.




It's a great experiment of fitting model. Thanks Mark!












cchien 發表在 痞客邦 留言(13) 人氣()



從去年護理系收留我到現在,壓根兒沒有想過自己能夠得到什麼學費減免。當初老闆Michael根本無法給我任何保證,因此只期盼每個月能夠按時領到薪水就好了(這一點倒是無庸置疑,學校應該是不太會拖欠薪資的)。



上學期期末,系秘書貼公告要大家把薪水條送到系上審,我跟Michael要了一份copy,不等護理系這邊的消息,先送到系上看有沒有機會,因為當初dean說如果護理系不出錢的話,系上會幫忙分擔一部份。但當我把薪水條送去給Malissa時,她居然說護理系願意幫我付一半學費。



在聖誕節前聽到這個消息,應該就是最好的聖誕節禮物吧!後來大方的把tuition bill給defer掉,就一直等新的bill再去繳錢。等啊等,等了超過三個月,還沒看到bill的鬼影子。前星期LinLin姐跑來叫我快去上網去查我的bill,因為她的學費已經扣掉六千多元(差不多三分之二),這跟當初Malissa說的有點出入,於是抱著緊張的心情上網查了一下,還真的扣了三分之二耶!但事情還沒結束,LinLin說還會再扣一筆,最後剩下的錢只是student affair fee,差不多六百吧!我是一個相當容易滿足的人,那時心想說有扣三分之二就好,根本不抱著還會繼續減免的希望。結果前幾天再去網路上一查,天殺的真的又扣了一筆,最後真的只剩下六百多元要付。這學期的學費就這樣莫名其妙的被waive掉了。Michael真是我的再造父母啊~如果不是他收留我,現在我還在苦哈哈的付那九千美金的學費!



為了紀念這個感動的時刻,特把網站上的紀錄轉貼於此以資紀念。



=====================================================================================

This reflects new charges and payments that will appear on your next bill.

ACOUNT STATEMENT

DATE SUBCODE DESCRIPTION DEBITS CREDITS BALANCE

PREVIOUS BALANCE 9001.51

02/14/05 69220 TUIT REMISSION PH GRAD - 6245.50 2756.01

02/28/05 90601 PAYMENT - ISA - 2139.50 616.51

=====================================================================================















cchien 發表在 痞客邦 留言(3) 人氣()