Yesterday midnight, when I enjoyed watching X-files in the living room, a call disturbed me. That is my friend who is in the trouble of doing some data management in her dissertation. She has some character variables which looks like:

A/A

C/C

G/G

T/T

She needs to create two new variables. The first new variable contains the first word in old variable, and the second variable contains the last word in old variable. Both of them should also be converted into numeric variables. The criteria is A for 1, C for 2, G for 3, and T for 4.

Her first request is very easy, we can use a special function named "SUBSTR" to extract some word in specific location. The names of two new variables should be names as, for example:

old variable = rs1234567

1st new variable = rs1234567a

2nd new variable = rs1234567b

That's not difficult as well because we can just define the new names in data procedure. However, she said, "I have 3000 variables."

That's is a very very big dataset. If she define new variables by herself in SAS, she needs to write 6000 new variables. She thought a SAS macro could probably save her time, but it is still not efficient. Finally, I found out a good way to finish her job.

Step 1: Export her dataset into Excel, and the first line of the Excel file lists all variable names.

Step 2: Cut down those 3000 variables, and create a new dataset with 3000 temporary variables containing those 3000 variables.

Step 3: Put all 3000 temporary variables in an array named "old", and create 6000 new variables in another two array named "newa" and "newb".

Step 4: Use "Do...End" to concatenate 3000 old variables with "a" and "b", then put them into array newa and newb, respectively.

Step 5: Export this dataset into Excel, so all 6000 new variables are created.

The following program is a simplified example with only 5 old variables:

data test;

input (old1-old5) ($) ;

array old[5] $ old1-old5;

array newa[5] $12 newa1-newa5; /*adjust the lenght of new variables longer*/

array newb[5] $12 newb1-newb5;

do i=1 to 5;

newa[i]=right(old[i])||"a"; /*keep old variable right to prevent from a gap between old variable and a*/

newb[i]=right(old[i])||"b";

end;

keep newa1-newa5 newb1-newb5;

cards;

rs12 rs34 rs56 rs78 rs90

run;

proc print;run;

proc export data=test outfile="C:\Temp\newtest.xls"; run;

quit;

- Feb 07 Wed 2007 08:46
## How to concatenate two variables with array

- Feb 07 Wed 2007 08:37
## New category is established!

This is a new category in my blog which includes some practical issues in using SAS. Even though I use SAS for many years, I still can't say I totally understand SAS. The depth of SAS programming is far away from what we can imagine. The powerful of SAS is absolutely, but the complication of it also troubles many users. Many Taiwanese students in UNC-CH often ask some SAS problems. I'd be glade to help them because they also give me the chance to face some challenges that I didn't encounter before. I am sincere to share my solutions in this blog. It is a sort of record, but I rather regard it as a cumulative experience of using SAS in my life.

- Mar 16 Thu 2006 06:14
## Consulting Case Study -- No.20060316

I am struggling Marianne's case many weeks. I taught her how to use LRT to determine whether her reduced model is good, but the result always rejects null hypothesis. It doesn't make sense because all remaining independent variables are significant!!

I re-ran her program again, but have a good result that is totally different from Marianne's and doesn't reject null hypothesis. I requested her re-ran her SAS programs again, but she still had different results from mine.

After comparing our programs, I finally figured out what happened. That is, I keep all discrete independent variables in CLASS statement even though I don't use them in MODEL statement, but Marianne deleted them in both CLASS and MODEL statement, such as:

/*Marianne's program*/

proc genmod data=mb1.total;

class urban hospcode;

model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;

repeated subject=hospcode/type=exch;

run;

/*My program*/

proc genmod data=mb1.total;

class urban hospcode bedsize netwrk totalmagnet; <==here the difference is!!

model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;

repeated subject=hospcode/type=exch;

run;

I discuss with my supervisor, Mark, and he said both of the two programs are correct. Why are the results different? Because if we put variables in CLASS statement, SAS will delete all missing data in those variables, whatever the variables are in the model or not! On the contrary, if we don't keep those non-significant variables in CLASS statment, SAS will hold all data to fit the model.

Therefore, in likelihood ratio test, we need to delete all missing data and use the remaining completed data to yield likelihood function. After determining this reduced model is better than full model, then use whole dataset to fit final model, just as what Marianne's program shows!

- Feb 27 Mon 2006 05:03
## Consulting Case Study -- No.20060224

I met with Marianne, a post doc in the School of Nursing in UNC-CH,

to discuss some remaining problems in her publication. Those

problems caused me confused as well, but my supervisor Mark didn't

have time to participate in our meeting last Friday. Hence, I stop

by Mark's office again today (Monday). There are some useful

conclusion after meeting with Mark.

The first question is whether Likelihood Ratio Test(LRT) is

suitable in comparing two models have the same independent

variables but some of them have differnt attributes. For example,

one variable is continuous in a Model, but how about using a

categorical one in the same model? That's the key point of this

question. The only confused thing is that LRT is only used in

nested model, but I am not sure whether this kind of situation is

the same. Mark told me it is because we can regard the continuse

one is a reduced model and the categorical one is full model in

that there are more variables in categorical one. Based on this

assumption, we can use LRT as usual.

The second question is more complicated. Marianne had already

finish the part of model selection, but just need a LRT to confirm

her final models are the best one. By using LRT, the decision

should be non-significant with large p-value, then we can have no

rejection of the null hypothesis which is reduced (final) model.

However, it is totally conflicted because the result is

significant. After checking the original SAS code, there is no

problem as well. However, Mark said, based on Marianne study

design, she needs to keep two important variables in this model

whatever it is significant or non-significant. After including the

two variables in this model, the conflict was eliminated. But, I

was still wondering whether one of them is highly overlapped with

another one because they are all geographic variables and have

highly similarity. I dropped out a less important one and fit the

model again, the result looked better. From this problem, we can

understand that we need to know more about variables before model

fitting, then we will decrease confusion from that.

The two solutions had already been emailed to Marianne. Hope she

will feel useful.

- Feb 08 Wed 2006 05:35
## Consulting Case Study -- No.20060207

June Cho, a Korean woman who is a postdoc in the School of Nursing in UNC-CH. I handled with her dissertation from 2004.DEC to 2005.MAY, and she graduated smoothly on 2005.JUL. Her husband is a professor in the School of Pharmacy. I guess they have been the U.S. citizens. After she graduated, she stay here to be her advisor's postdoc, and keep doing advanced research from her dissertation.

She wants to do a 2-way ANOVA to compare simple main effect in her current study. It's very easy, but she just needs my confirmation. I constructed a macro to her and she can just call this macro to fit all of her models (18 models). However, simple main effect is only used under the interaction term is significant. I only ran a model and the interaction term is significant, but I can predict not all of them have significant interaction terms. However, simple main effect is her only purpose of current research. How could we do it under non-significant interaction?

Regularly, I asked my supervisor, Mark. He said even though the interaction term is not significant, but we can still keep it in GLM model. Therefore, we con consist all results of simple main effect from those 18 models because all of them include interaction term. This could be a more suitable conclusion in discussion section.

-----

- Feb 06 Mon 2006 02:58
## The test of multivariate normal distribution

In some statistical analysis, we'd like to test assumption of

normality in the beginning before analyzing. In univariate case, we

all understand Q-Q plot and some K-S statistic can be used to

assess normality. However, in multivariate normal distribution, how

about that?

Mardia's statistic is a test for multivariate normality. Based on

functions of skewness and kurtosis, Mardia's PK should be less than

3 to assume the assumption of multivariate normality is met. But,

whatever in SAS or SPSS, there is no easy way to use any statement

to perform it in any procedure.

In SAS, we need to use a macro procedure to calculate Mardia's PK

statistics. SAS Inc. released the codes on official website. Please

check the following link:

http://support.sas.com/ctx/samples/index.jsp?sid=480

Also, in SPSS, we need to use a macro to examine

bivariate/multivariate normality. Check it:

http://www.columbia.edu/~ld208/

-----

- Feb 03 Fri 2006 08:44
## Consulting Case Study -- No.20060203

Lindsey Austin, a master student (I guess) who works for a professor to be something (I am not sure whether she is a TA). Her professor requests her to analyze some records to see student's study ability. However, she is not good at statsitics, so she sent the data set to me.

The question is very easy: how to calculate the correlation between individual scores and GPA in reading, math, science, and fundamentals in some courses. The individual score variable is scale (0~100), but the GPA is ordinal (A+, A, A-,...., F).

In correlation analysis, there are three correlation coefficients we often use: Pearson, Kendall's tau, and Spearman. However, none of them are for the case of "scale vs ordinal".

I am wondering whether there are some special correlation coefficients that I don't know. I went to check SAS menu to see "PROC CORR", but there is no special correlation. My supervisor, Mark, even took his old handouts (because he also graduates from biostatics department in UNC-CH) to search for any evidence, but there is no way as well.

Finally, we conclude that, we can rank the individual score variable, and use Spearman correlation.

This is a pretty special case. I think there should be a specific correlation for this situation, but we haven't figure it out. If so, I will show here.

- Jan 25 Wed 2006 09:27
## Preface of new category

I construct a new category to put some articles of my statistical

consulting experiences.

It's not only a memory, but also a record that I can retrospect

some special cases or problems if I encounter some similar

situations in the future.

I'll write these articles in my office and type in English. It

doesn't mean I am show off my English ability because there is no

Chinese input software in the computer in my office.

Don't feel picky in my grammar and spelling. I recognize my English

is poor. Well, if there is any suggestion from someone, welcome to

leave your message in relative articles.

- Jan 25 Wed 2006 09:10
## Consulting Case Study -- No.20060126

Marianne, a postdoc in the School of Nursing in UNC-CH. She lives in Virginia now. Her study is a comparison of urban and rural hospitals' nurse work environment and quality of care. I consult this case from last December.

The model she fits is a hierarchical regression model with generalized estimating equation (GEE). GEE method is very popular in model fitting when independent variables are highly correlated and there are many missing data in data set as well. It was created by Kung-Yi Liang, a Taiwanese who is a professor in the department of biostatistics in Johns Hopkins University.

Today, Marianne sent an email and show an error message in her SAS output. The problem is that the Hessian matrix is semi positive definite, and the program is terminated. I once encountered this situation before, but I don't know how to solve it so far. My supervisor, Mark, told me it is appeared when there are some empty cells in some categorical independent variables vs. categorical dependent variable. We can use PROC FREQ in SAS to check what IV has empty cells, then delete it.

As what Mark said, there is an empty cell in an IV. After deleting it, the output is shown without any error message.

It's a great experiment of fitting model. Thanks Mark!

- Mar 04 Fri 2005 12:14
## 學費‧全部‧免

從去年護理系收留我到現在，壓根兒沒有想過自己能夠得到什麼學費減免。當初老闆Michael根本無法給我任何保證，因此只期盼每個月能夠按時領到薪水就好了（這一點倒是無庸置疑，學校應該是不太會拖欠薪資的）。

上學期期末，系秘書貼公告要大家把薪水條送到系上審，我跟Michael要了一份copy，不等護理系這邊的消息，先送到系上看有沒有機會，因為當初dean說如果護理系不出錢的話，系上會幫忙分擔一部份。但當我把薪水條送去給Malissa時，她居然說護理系願意幫我付一半學費。

在聖誕節前聽到這個消息，應該就是最好的聖誕節禮物吧！後來大方的把tuition bill給defer掉，就一直等新的bill再去繳錢。等啊等，等了超過三個月，還沒看到bill的鬼影子。前星期LinLin姐跑來叫我快去上網去查我的bill，因為她的學費已經扣掉六千多元（差不多三分之二），這跟當初Malissa說的有點出入，於是抱著緊張的心情上網查了一下，還真的扣了三分之二耶！但事情還沒結束，LinLin說還會再扣一筆，最後剩下的錢只是student affair fee，差不多六百吧！我是一個相當容易滿足的人，那時心想說有扣三分之二就好，根本不抱著還會繼續減免的希望。結果前幾天再去網路上一查，天殺的真的又扣了一筆，最後真的只剩下六百多元要付。這學期的學費就這樣莫名其妙的被waive掉了。Michael真是我的再造父母啊～如果不是他收留我，現在我還在苦哈哈的付那九千美金的學費！

為了紀念這個感動的時刻，特把網站上的紀錄轉貼於此以資紀念。

=====================================================================================

This reflects new charges and payments that will appear on your next bill.

ACOUNT STATEMENT

DATE SUBCODE DESCRIPTION DEBITS CREDITS BALANCE

PREVIOUS BALANCE 9001.51

02/14/05 69220 TUIT REMISSION PH GRAD - 6245.50 2756.01

02/28/05 90601 PAYMENT - ISA - 2139.50 616.51

=====================================================================================