第四次作业R002-R语言绘制箱型图

真·科研狗 2017-08-09 02:33:08 阅读: 1685

首先先了解下绘制箱型图的函数：boxplot() 注意和barplot()的区别。boxplot()的用法如下：

#S3 method for class 'formula'（帮助文档里面写的，不知道S3什么意思）
boxplot(
    formula, 
    data = NULL, 
    ..., 
    subset, 
    na.action = NULL,
    drop = FALSE, 
    sep = ".", 
    lex.order = FALSE
)

formula: 一般写法是 x ~ group，x可以理解为数值，就是箱型图的Y轴，group是分组，就是x轴的分组；

data：为数据帧frame，其中包含两列，一列名称为x，另一列名称为group；

先来看一下简单的列子：

png(file="E:/boxplot.png")
boxplot(
  count ~ spray,
  data=InsectSprays,
  col = "lightgray",
  xlab="X轴坐标",
  ylab="Y轴坐标"
)
dev.off()

运行之后得到如下的图片：

我们来看一下data=Insectsprays到底是什么数据，我们直接在控制台输入

InsectSprays

会得到如下的数据，看完之后就会明白了formula就是count~spary

微信截图_20170809010729.png

接着看参数：

subset: help()文档里面说：an optional vector specifying a subset of observations to be used for plotting.应该是对数据进行筛选，但是在boxplot文档里面没有说明，我们使用helpe查询subset:

## Default S3 method:
subset(x, subset, ...)

## S3 method for class 'matrix'
subset(x, subset, select, drop = FALSE, ...)

## S3 method for class 'data.frame'
subset(x, subset, select, drop = FALSE, ...)

na.action： a function which indicates what should happen when the data contain NAs. The default is to ignore missing values in either the response or the group.意思应该是说当遇到数据为空数据的时候采用什么方法，默认的是忽略。

boxplot还有另外一种参数写法：

boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
        notch = FALSE, outline = TRUE, names, plot = TRUE,
        border = par("fg"), col = NULL, log = "",
        pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
        horizontal = FALSE, add = FALSE, at = NULL)

里面的参数用于调整图像的各个样式。

如下例子：

png(file="E:/boxplot.png")
rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque")
title("Comparing boxplot()s and non-robust mean +/- SD")
mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)
sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)
xi <- 0.3 + seq(rb$n)
points(xi, mn.t, col = "orange", pch = 18)
arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,code = 3, col = "pink", angle = 75, length = .1)
dev.off()

上面的代码是比较箱型图和平均值误差。运行上面代码，得到如下图片：

1. 回到本次作业：

png(file="E:/boxplot.png")
boxplot(pressure)
dev.off()

得到如下图片：

比如图像颜色，大小，X轴Y轴坐标等和之前做的柱形图同样的设置。

2. 设置默认工作空间

getwd()
setwd("E:/study/")

注意路径一定是要存在的，如果不存在可能会报错。

png("E:/R4-2.png")
data<-read.csv('E:/R4_data.csv',header=TRUE)
print(data[,1,3])
boxplot(
  data[,1:3],
  notch=TRUE,
  col=c('red','green','blue')
)

上面代码中，读取csv的时候可以设置header=TRUE或者FALSE，就是是否把第一行当做列名称。另外notch的解释如下：

if notch is TRUE, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ。

另外使用PDF输出，基本的和PNG的一致，把png("E:/R4-2.png")替换为pdf("E:/R4-2.pdf");

3 接下来将箱型图变为水平，先看如下截图：

在之前介绍过，读取的xls或者csv是一个数据帧，数据帧通过"$" 可以获得每一行的名称。

完整代码如下：

png("E:/R4-3.png")
data<-read.csv('E:/R4_data.csv',header=TRUE)
boxplot(
  data$Height~data$T,
  col=c('red','green','blue'),
  horizontal=TRUE
  )
dev.off()

这个和本文最开始介绍的formula一致，在这里没有提供第二个参数。

4.最后的数据如下：

刚开始写了这个代码：

png("E:/R4-4.png")
data<-read.csv('E:/data1_cancer.csv',header=TRUE)
boxplot(
  KRAS~TUMOR,
  data,
  col='red',
  xlegend=c("normal","tumor")
  )
boxplot(
  NRAS~TUMOR,
  data,
  col='green',
  add=TRUE
)
boxplot(
  HRAS~TUMOR,
  data,
  col='yellow',
  add=TRUE
)
dev.off()

add=TRUE意思就是往第一个里面新增加，但是得到了如下的图片：

原因可能是没有设置偏移，我们查看了下help文档，发现有如下的提示：

at: numeric vector giving the locations where the boxplots should be drawn, particularly when add = TRUE; defaults to 1:n where n is the number of boxes.

当我们设置add=TRUE的时候，需要设置偏移，当我们设置如下的时候：

png("E:/R4-4.png")
data<-read.csv('E:/data1_cancer.csv',header=TRUE)
boxplot(
  KRAS~TUMOR,
  data,
  col='red',
  at=c(1:2)-0.2
  )
boxplot(
  NRAS~TUMOR,
  data,
  col='green',
  add=TRUE,
  at=c(1:2)
)
boxplot(
  HRAS~TUMOR,
  data,
  col='yellow',
  add=TRUE,
  at=c(1:2)+0.2
)
dev.off()

得到了如下的图片：

好像比刚才好了一点了，可以看出这个盒子比较宽，我们需要把盒子的宽度设置小一点，在help文档里面有如下的提示：

boxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.

上面的图片还有一个问题在于分组坐标全部为0或者1，我们需要设置分组类别，可以找到文档说明：

names： group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).

最终调整代码如下：

png("E:/R4-4.png")
data<-read.csv('E:/data1_cancer.csv',header=TRUE)
boxplot(
  KRAS~TUMOR,
  data,
  col='red',
  boxwex=0.15,
  at=c(1:2)-0.2,
  ylim=c(0,2000),
  axes=FALSE
  )
boxplot(
  NRAS~TUMOR,
  data,
  col='green',
  names = c("Normal","Tumor"),
  add=TRUE,
  boxwex=0.15,
  at=c(1:2)
)
boxplot(
  HRAS~TUMOR,
  data,
  col='yellow',
  add=TRUE,
  boxwex=0.15,
  at=c(1:2)+0.2,
  axes=FALSE
)
dev.off()

我们设置了第一组和第三组的axes为FALSE，这样就显示中间的NORMAL和TUMOR坐标了。

同时在第一组设置了ylim，执行到第一个boxplot的时候，如果没有设置ylim，则会根据第一组数据确定图像的最大值，因为第三组数据有一个接近2000的，所以需要设置ylim=c(0,2000)，之前也学习过，后期可以通过程序自动判断最大值。

得到图片如下：