這一篇其實一個多月前就該發的,但時間一拖自己也熊熊忘了。剛剛才想到所以稍微寫一下發表。
目前分類:Stat Consulting (10)
- Jan 11 Sun 2009 08:19
The group comparison in log-transformed linear models
- Sep 10 Wed 2008 02:07
分類。分類。分
好像很久沒有發 statistical consulting 的文章,因為這幾年在護理系什麼樣大大小小的統計問題都遇過了,後面的挑戰性就越來越低,所以也沒遇到什麼麻煩。
- Sep 09 Tue 2008 14:29
分類。分類。分
好像很久沒有發 statistical consulting 的文章,因為這幾年在護理系什麼樣大大小小的統計問題都遇過了,後面的挑戰性就越來越低,所以也沒遇到什麼麻煩。
不過上星期一個 clinical staff 的人寫 email 給我請求統計協助。她說她找了兩個我們系上的學生問,可是他們都不會。我想說這是什麼樣的天大難題,居然讓兩個生統系博士班的學生束手無策,所以我就先答應了她,然後把她 email 給我的一個文件下載來看看。
結果,短短兩頁的說明文件,讓我看了一個多小時才看的懂。不是她寫的用字太難,而是整個敘述相當沒有條理。後來經過我額外整理出十點條目才把她的研究和問題釐清。
簡單來講,這是一個針對膝蓋手術後的療程試驗,有實驗組和對照組。實驗組用的是新療程,對照組當然是用傳統方法。但問題時,這六十幾個病人有些在進入這個臨床試驗時就已經用了額外的療法,有些病人則是在進入試驗後,自己額外去使用別的療法。總之這些病人的疼痛效應完全被跟實驗不相干的療程給干擾到了。
而基本上這種情況幾乎可以預想到統計分析的結果絕對會有偏差,而他們也不可能再用原始的組別(實驗組vs對照組)去作檢定或分析。但要他們去重新收資料已經是不可能了,畢竟金錢和人力都花下去了,現在就看有沒有辦法在 study design 上作任何改進。
當時我第一個想到的就是重新分群,但每個人有各自不同複雜的情況,即便是這位 staff 寄給我資料我還是看不太懂。今天她跑來親自解釋給我聽這些病人的狀況,我再依照她的敘述把上星期整理的十個條目減為八個條目,然後設定八個 binary 變數來代表這八個條目。最後把這八個變數丟去跑 cluster analysis 後畫出像這樣子的樹狀圖:
然後就叫她把 output 帶回家自己去歸類並重新替新組別命名,而這部分我就幫不上忙了,因為只有她最瞭解每個病人的特徵。我請她去朝找尋同一組人有沒有共同治療特徵,再用有意義的名稱去定義新分群。而這也是目前我想到唯一能讓她繼續把接下來的分析弄完的方法。希望老天爺夠照顧她能讓她真的找到好的新分群。
- Feb 28 Wed 2007 01:22
QIC and GEE model diagnostic
Ann Jessup,一個護理系即將畢業的博士班學生,其 committee member 之一正是我老闆 Dr. Bangdiwala。由於這一陣子我老闆跑去智利不知道去幹啥,因此 Ann 的博士論文中的統計部分就落到我的頭上了。
她的論文題目裡面用到所謂的 GEE model,這對我來說並不是太困難的事情。但之後卻發生了連我 RA 老闆 Mark 都不知道該怎樣解決的麻煩。第一個麻煩是,Ann 要從下面這三個 GEE model 中挑選最好的一個
:
Y=X1+X2+X3+W
Y=X1+X2+X3+V
Y=X1+X2+X3+Z
這三個模式除了最後一個獨立變數不一樣之外,其餘都一樣。簡單的作法就是分別做三個模式,如果 W,V,Z 其中一個顯著而另兩個不顯著,自然就可以認定含有那一個顯著變數的模式是比較好的。
但是,如果有兩個以上顯著呢?
當然,我沒有一開始就直接去跑這三個模式,所以不知道結果會是如何,但總是要假設一下各種可能性。如果是在 mixed model,這種 non-nested 的情況就直接用 AIC 或 BIC 這兩個數據來判定,越小的越好。那 GEE model 有沒有類似的準則可以判定呢?答案是有的!GEE model 也可以用 AIC,但不幸的是,目前 SAS 9.1.3 版的 PROC GENMOD 沒有內建這個功能。(聽說9.2版會內建了,但不知啥時才會 release 出來)。於是我就上網去看看有沒有其他人發明相關的方法,果真發現一個叫做 QIC 的數據,而且 SAS 公司已經把他的 macro 程式放在網上任意下載。所以我就拿 SAS 提供的程式搞定了這一部份。
第二個麻煩是,當選好了 GEE model,要如何做模式檢測來挑出可能的離群值或影響值呢?其他的統計模式都有些方法可以來做模式檢測,就唯獨 GEE model 沒看過什麼模式檢測。我翻了以前上課的講義,老師也沒講這一段。跑去問 Mark,他居然也不知道。此時,又得去拜一下 Google 大神了。沒想到不只有人把這段理論給弄了出來,連程式也寫好了。不過更幸運的是,發表這些 paper 的學者中,有一個共同名字,叫做 John Pressier。而這位仁兄,正巧就是我的 committee member 之一。
於是乎當然就很高興的把 paper 稍微讀一下,接著就馬上下載他寫好的 macro 來用。中間遇到一個問題,那就是他的程式裡面並沒有辦法設定某些變數是離散的。後來直接寫信去問他,他叫我把那些變數改成 dummy variable 就搞定了。
所以,就在這麼許多巧合和運氣之下,Ann 的兩大問題就這樣被我解決了。同時間自己也學到了很多,其實我應該要更感謝 Ann 才對,否則我永遠也不會知道 QIC 和 GEE model diagnostic 的存在。來護理系工作真是惠我良多啊。
- Mar 16 Thu 2006 06:14
Consulting Case Study -- No.20060316
I am struggling Marianne's case many weeks. I taught her how to use LRT to determine whether her reduced model is good, but the result always rejects null hypothesis. It doesn't make sense because all remaining independent variables are significant!!
I re-ran her program again, but have a good result that is totally different from Marianne's and doesn't reject null hypothesis. I requested her re-ran her SAS programs again, but she still had different results from mine.
After comparing our programs, I finally figured out what happened. That is, I keep all discrete independent variables in CLASS statement even though I don't use them in MODEL statement, but Marianne deleted them in both CLASS and MODEL statement, such as:
/*Marianne's program*/
proc genmod data=mb1.total;
class urban hospcode;
model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;
repeated subject=hospcode/type=exch;
run;
/*My program*/
proc genmod data=mb1.total;
class urban hospcode bedsize netwrk totalmagnet; <==here the difference is!!
model psatisfa=urban workcomp asupserv / dist=nor link=identity type3;
repeated subject=hospcode/type=exch;
run;
I discuss with my supervisor, Mark, and he said both of the two programs are correct. Why are the results different? Because if we put variables in CLASS statement, SAS will delete all missing data in those variables, whatever the variables are in the model or not! On the contrary, if we don't keep those non-significant variables in CLASS statment, SAS will hold all data to fit the model.
Therefore, in likelihood ratio test, we need to delete all missing data and use the remaining completed data to yield likelihood function. After determining this reduced model is better than full model, then use whole dataset to fit final model, just as what Marianne's program shows!
- Feb 27 Mon 2006 05:03
Consulting Case Study -- No.20060224
I met with Marianne, a post doc in the School of Nursing in UNC-CH,
to discuss some remaining problems in her publication. Those
problems caused me confused as well, but my supervisor Mark didn't
have time to participate in our meeting last Friday. Hence, I stop
by Mark's office again today (Monday). There are some useful
conclusion after meeting with Mark.
The first question is whether Likelihood Ratio Test(LRT) is
suitable in comparing two models have the same independent
variables but some of them have differnt attributes. For example,
one variable is continuous in a Model, but how about using a
categorical one in the same model? That's the key point of this
question. The only confused thing is that LRT is only used in
nested model, but I am not sure whether this kind of situation is
the same. Mark told me it is because we can regard the continuse
one is a reduced model and the categorical one is full model in
that there are more variables in categorical one. Based on this
assumption, we can use LRT as usual.
The second question is more complicated. Marianne had already
finish the part of model selection, but just need a LRT to confirm
her final models are the best one. By using LRT, the decision
should be non-significant with large p-value, then we can have no
rejection of the null hypothesis which is reduced (final) model.
However, it is totally conflicted because the result is
significant. After checking the original SAS code, there is no
problem as well. However, Mark said, based on Marianne study
design, she needs to keep two important variables in this model
whatever it is significant or non-significant. After including the
two variables in this model, the conflict was eliminated. But, I
was still wondering whether one of them is highly overlapped with
another one because they are all geographic variables and have
highly similarity. I dropped out a less important one and fit the
model again, the result looked better. From this problem, we can
understand that we need to know more about variables before model
fitting, then we will decrease confusion from that.
The two solutions had already been emailed to Marianne. Hope she
will feel useful.
- Feb 08 Wed 2006 05:35
Consulting Case Study -- No.20060207
June Cho, a Korean woman who is a postdoc in the School of Nursing in UNC-CH. I handled with her dissertation from 2004.DEC to 2005.MAY, and she graduated smoothly on 2005.JUL. Her husband is a professor in the School of Pharmacy. I guess they have been the U.S. citizens. After she graduated, she stay here to be her advisor's postdoc, and keep doing advanced research from her dissertation.
She wants to do a 2-way ANOVA to compare simple main effect in her current study. It's very easy, but she just needs my confirmation. I constructed a macro to her and she can just call this macro to fit all of her models (18 models). However, simple main effect is only used under the interaction term is significant. I only ran a model and the interaction term is significant, but I can predict not all of them have significant interaction terms. However, simple main effect is her only purpose of current research. How could we do it under non-significant interaction?
Regularly, I asked my supervisor, Mark. He said even though the interaction term is not significant, but we can still keep it in GLM model. Therefore, we con consist all results of simple main effect from those 18 models because all of them include interaction term. This could be a more suitable conclusion in discussion section.
-----
- Feb 03 Fri 2006 08:44
Consulting Case Study -- No.20060203
Lindsey Austin, a master student (I guess) who works for a professor to be something (I am not sure whether she is a TA). Her professor requests her to analyze some records to see student's study ability. However, she is not good at statsitics, so she sent the data set to me.
The question is very easy: how to calculate the correlation between individual scores and GPA in reading, math, science, and fundamentals in some courses. The individual score variable is scale (0~100), but the GPA is ordinal (A+, A, A-,...., F).
In correlation analysis, there are three correlation coefficients we often use: Pearson, Kendall's tau, and Spearman. However, none of them are for the case of "scale vs ordinal".
I am wondering whether there are some special correlation coefficients that I don't know. I went to check SAS menu to see "PROC CORR", but there is no special correlation. My supervisor, Mark, even took his old handouts (because he also graduates from biostatics department in UNC-CH) to search for any evidence, but there is no way as well.
Finally, we conclude that, we can rank the individual score variable, and use Spearman correlation.
This is a pretty special case. I think there should be a specific correlation for this situation, but we haven't figure it out. If so, I will show here.
- Jan 25 Wed 2006 09:27
Preface of new category
I construct a new category to put some articles of my statistical
consulting experiences.
It's not only a memory, but also a record that I can retrospect
some special cases or problems if I encounter some similar
situations in the future.
I'll write these articles in my office and type in English. It
doesn't mean I am show off my English ability because there is no
Chinese input software in the computer in my office.
Don't feel picky in my grammar and spelling. I recognize my English
is poor. Well, if there is any suggestion from someone, welcome to
leave your message in relative articles.
- Jan 25 Wed 2006 09:10
Consulting Case Study -- No.20060126
Marianne, a postdoc in the School of Nursing in UNC-CH. She lives in Virginia now. Her study is a comparison of urban and rural hospitals' nurse work environment and quality of care. I consult this case from last December.
The model she fits is a hierarchical regression model with generalized estimating equation (GEE). GEE method is very popular in model fitting when independent variables are highly correlated and there are many missing data in data set as well. It was created by Kung-Yi Liang, a Taiwanese who is a professor in the department of biostatistics in Johns Hopkins University.
Today, Marianne sent an email and show an error message in her SAS output. The problem is that the Hessian matrix is semi positive definite, and the program is terminated. I once encountered this situation before, but I don't know how to solve it so far. My supervisor, Mark, told me it is appeared when there are some empty cells in some categorical independent variables vs. categorical dependent variable. We can use PROC FREQ in SAS to check what IV has empty cells, then delete it.
As what Mark said, there is an empty cell in an IV. After deleting it, the output is shown without any error message.
It's a great experiment of fitting model. Thanks Mark!