This is a quick guide to performing common statistical tests and expressions in R
Basic stats  R
Basic stats
A series e.g.series < c(0,1,2,3)
SD of a series 
sd(series)
Mean 
mean(series)
Converting an odds ratio to a probability 
OR / (1 + OR)
Probability
The choosing function  "What are the chances in getting a certain 2 cards from 2 draws of a 52 card deck?"choose(52,2)
Binomials  coin flip experiments
Variance of a binomialnp(1p) ; Number of trials * probability * (1  probability)
SD of a binomial
sqrt(np(1p))
"What are the chances of getting exactly 1 head from 5 coin tosses?"
dbinom(1,5,0.5)
"What are the chances in a couple having at least 1 boy if they have 5 children?"
1pbinom(0,5,0.5)
Or
pbinom(5,5,0.5)pbinom(0,5,0.5)
"What is the standard deviation for 500 coin flips?"
sqrt(500*0.5*(10.5))
Normal distributions/Z scores
Proportion of values with a Z score of 1 (or 1 SD below mean)pnorm(1)
Approximating binomials to normals, e.g. N=100 and p=.2; calculate the probability of getting 10 successes or fewer
1. Calculate mean
(n * p)
= 202. Calculate SD =
sqrt(100 * 0.2 * (1  0.2))
= 43. Calculate Z score =
(result  mean) / SD
= 10/4 = .254.
pnorm(Zscore)
...Or you could just do
pbinom(10,100,0.2)
of course and not approximate the binomial to a normal...E.g. What is the p value for 100 volunteers flipping 65 100 coins and getting 65 (or more) heads?
(1pbinom(64,100,0.5))
And a twosided test:
(1pbinom(64,100,0.5))*2
Statistical tests  continuous data 


Parametric* 
Nonparametric 

Independent 
Correlated 

Ttest**(For 2 groups)Two sided: t.test(series1, series2) One sided: t.test(series1,series2, alternative="<greater or less>") One sample: t.test(series1,mu=<expected mean>) Unequal variances: t.test(series1,series2, var.equal=FALSE) Or if given difference, SD, n calculate the T score: Tscore = Diff / ( (pooledSD2 / n) + (pooledSD2 / n)) or Tscore = Diff / (standarderror) then calculate p value: pt(abs(Tscore),df=<df>)
ANOVA**(2+ groups; "Is the variance between groups greater than the variance within them?")
y = c(ser1, ser2, ser3) For 95% CIs:
anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), <df>,lower.tail = FALSE) (Where <df> derived from Residuals from 1st test above)
Pearson's correlation coefficient(1 continuous linear predictor)>cor.test(series1,series2,method="pearson",use="complete.obs")
Linear regressionSimple linear regression:Create a data frame: dataframe = data.frame(outcome = series1, predictor1 = series2, predictor2 = series3, categoricalpredictor = series4)

Paired Ttest** 2 groups or time points Two sided:t.test(series1, series2, paired=TRUE)
Repeatedmeasures ANOVA**2+ groups or time pointse.g. 2 groups are followed up over multiple time points, not just simple before and after (Ttest)
Mixed models/GEE modellingMV regression 
Wilcoxon ranksum AKA MannWhitney U(Alternative to Ttest; small samples)wilcox.test(series1,series2)
Wilcoxon signrank(Alternative to paired Ttest; small samples)wilcox.test(series1,series2,paired=TRUE)
KruskalWallis(Alternative to ANOVA)
Spearman rank correlation coefficient(Alternative to Pearson's) 
Statistical tests  categorical data 

Independent 
Correlated 
Alternatives 
Risk difference/relative risksFor a 2x2 table (where n*p>5; otherwise can't assume normal; use nonparametric)mymatrix < matrix(c(25,20,20,25),ncol=2,byrow=TRUE)
ChisquareRxC tableSane as 2x2 table above
Logistic regressionSee instructions for assembling the data frame for 'Linear regression'Then: df.logit = glm(outcome ~ predictor1 + factor(predictor2), data = dataframe, family = "binomial") N.B. Estimate for predictors will be given as logits (natural log of the OR; ln(p/1p)). These can be converted to an OR using exp(booksmartlogit$coefficients) . If e.g. 0.96, this means for every 1 increase in your predictor variable, your odds decrease by 4%.
For confidence intervals of the coefficients:confint(df.logit) And for confidence intervals of the ORs: exp(cbind(OR = coef(df.logit), confint(df.logit))) If a categorical variable (factor; e.g. predictor2) is used, each level will be split as a 'term' seen in the summary(). An overall effect of the categorical predictor can be assessed using the wald.test() function of the aod library:wald.test(b = coef(df.logit), Sigma = vcov(df.logit), Terms = 3:5) ...Where 3:5 refers to themes 3 to 5 (starting at 1; in this example the first level of predictor2 would be theme 3) If you want to look for an interaction (i.e. is predictor1 affecting the outcome significantly differently in males vs female): df.logit = glm(outcome ~ predictor1 * factor(male), data = dataframe, family = "binomial") (You can test all combinations of interactions between a , b , and c with a * b * c , or only a subset of the interactions using a + b + c + a:b (which would not test for a:c or a:b:c)

McNemar's ChiSquare testFor a 2x2 table with correlated data
Conditional logistic regressionLike logistic regression, but for correlated data

Fisher's exact testAlternative to Chisquare for sparse noncorrelated data (expected in any cell < 5)Perform as per 2x2 table, but replace summary() with:fisher.test(mytable)
McNemar's exact testAlternative to McNemar's test for sparse correlated dataPerform as per 2x2 table, but replace summary() with:library(exact2x2)

Statistical tests  Timetoevent 

Independent 
Correlated 
Alternatives 
Rate ratio(Two groups)
KaplanMeier & Log rank(Two+ groups)library(survival) Load data e.g.: hmohiv<read.table("http://www.ats.ucla.edu/stat/r/examples/asa/hmohiv.csv", sep=",", header = TRUE) Assuming right censored data: surv < survfit(Surv(time, censor)~ strata(drug), conf.type="none")
Calculate log rank:survdiff(Surv(time, censor) ~ drug)
Cox regression(Multivariate regression technique producing hazard ratios; two+ groups)library(survival) Load data e.g. and preview the top rows: data(lung)
Add a survival object (status == 2 is death)lung$SurvObj < with(lung, Surv(time, status == 2))
Calculate KM estimates for whole set and then by sex:
km.as.one < survfit(SurvObj ~ 1, data = lung, conf.type = "loglog")
Fit Cox regression, adjusting for age, sex, Karnofsky performance score, wt loss and print data:res.cox1 < coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung) Check for violation of proportional hazard (constant HR over time) and plot scaled Schoenfeld residuals, along with a smooth curve, in a 2x2 plot
res.zph1 < cox.zph(res.cox1)  Look for significant p valuespar(mfrow=c(2,2))  Observe graphs for obvious nonlinearity  sometimes above tests fail to detectIf the above imply nonlinearity one can either stratify if a categorical variable e.g. ph.karno in this example:res.cox1.strata < coxph(SurvObj ~ age + sex + strata(ph.karno) + wt.loss, data = lung) or on can use a Timevarying effects model  see right 
Frailty model(MV regression) 
Timevarying effects(Akin to Coxregression but for violation of the proportional hazard ratio assumption)
library(survival)  set a column to 1 or 0 if dead (status==2) or not.Create a survival data set with cuts at 200 and 400 (based around the appearance of the Schoenfeld residuals where (see Cox regression):
lung.split < survSplit(data = lung, Create a survival object column: lung.split$SurvObj < with(lung.split, Surv(time = (start), time2 = time, event = event)) Preview your data: head(lung.split, 10) Perform the analysis whilst testing for interaction between the violating variable and the cut (start) times and print the results
res.cox1.strata < coxph(SurvObj ~ age + sex + ph.karno + ph.karno:factor(start) + wt.loss + cluster(id), data = lung.split)

* Normally distributed outcomes  important for small samples. Large samples are robust against this because of central limit theorem; means are normally distributed, even if underlying trait is not).
** Homogeneity of variances  though with Ttest you can unpool the variances. Can test this using Fisher's Ftest: var.test(a,b)
where a and b are an array of values. If p > 0.05 then assume the variances are homogenous. Also, can compare the value for F for alpha = 0.05 with qf(0.95, <degrees of freedom numerator>, <degrees of freedom demonimator>)
, where e.g. DOF both = 9 for comparing 2 groups of 10 each.