Scatterplot examples

Linear Relationships

All three of these example are indications that linear regression is a reasonable to way to summarize the relationship between \(X\) and \(Y\).

Mostly linear

This happens when we have a moderately high to strong correlation.

#| standalone: true
#| viewerHeight: 500
library(shiny)
Nmax <- 1000
X <- rnorm(Nmax)
Err <- rnorm(Nmax)
ui <- fluidPage(
inputPanel(
  selectInput("N", label = "Sample Size:",
              choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
  sliderInput("rho", label = "Correlation Coefficient:",
              min = .75, max = 1, value = .85, step = 0.05),
  checkboxInput("sign","Negative Correlation",FALSE)
),
mainPanel(
  plotOutput("plots")))

server <- function (input,output) {
  output$plots <- renderPlot({
  N <- input$N
  rho <- input$rho*ifelse(input$sign,-1,1)
  Y <-  rho*X[1:N] + sqrt(1-rho*rho)*Err[1:N]
  plot(X[1:N],Y,main=paste("Correlation =",rho),xlab="X")
  abline(a=0,b=rho,col="red")
},width=288,height=288)
}
shinyApp(ui=ui,server=server)

Blobby Elipse

As the correlation coefficient gets lower, the scatterplot looks more blobby, but you can still tell that there is a slope. This is a weak to moderate correlation.

#| standalone: true
#| viewerHeight: 500
Nmax <- 1000
X <- rnorm(Nmax)
Err <- rnorm(Nmax)
library(shiny)
ui1 <- fluidPage(
inputPanel(
  selectInput("N1", label = "Sample Size:",
              choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
  sliderInput("rho1", label = "Correlation Coefficient:",
              min = .25, max = .75, value = .5, step = 0.05),
  checkboxInput("sign1","Negative Correlation",FALSE)
),
mainPanel(
  plotOutput("plot1")))

server1 <- function (input,output) {
  output$plot1 <- renderPlot({
  rho <<- input$rho1*ifelse(input$sign1,-1,1)
  Y <<-  rho*X[1:input$N1] + sqrt(1-rho*rho)*Err[1:input$N1]
  plot(X[1:input$N1],Y,xlab="X",main=paste("Correlation =",rho))
  abline(a=0,b=rho,col="red")
},width=288,height=288)
}
shinyApp(ui=ui1,server=server1)

No Relationship

Not much is going on here. One thing that confuses people is the idea that linear regression doesn’t work here. Actually, it gives a quite accurate picture: it tells you that not much is going on, which is what is actually happening. The prediction from the regression will be that \(\bar Y\) is the best predicted value for \(Y\).

#| standalone: true
#| viewerHeight: 500
library(shiny)
Nmax <- 1000
X <- rnorm(Nmax)
Err <- rnorm(Nmax)
ui2 <- fluidPage(
inputPanel(
  selectInput("N2", label = "Sample Size:",
              choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
  sliderInput("rho0", label = "Correlation Coefficient:",
              min = -.25, max = .25, value = .0, step = 0.05),
  checkboxInput("sign0","Negative Correlation",FALSE)
),
mainPanel(
  plotOutput("plot2")))

server2 <- function (input,output) {
  output$plot2 <- renderPlot({
  rho <- input$rho0*ifelse(input$sign0,-1,1)
  Y <-  rho*X[1:input$N2] + sqrt(1-rho*rho)*Err[1:input$N2]
  plot(X[1:input$N2],Y,xlab="X",main=paste("Correlation =",rho))
  abline(a=0,b=rho,col="red")
},width=288,height=288)
}
shinyApp(ui=ui2,server=server2)

Signs that the linear model doesn’t work.

The challenge to using regression (and correlation) to summarize the relationship between \(X\) and \(Y\) is when the relationship is non-linear. Here the correlation/regression will tell about the linear part of the relationship, but missing the non-linear part. If the non-linear part is small, this might not be too bad. But if it is big, then linear regression could be misleading. (There are various types of non-linear regression that are covered in more advanced classes).

Curve

A curved relationship doesn’t look like a line.

Consider a quadradic relationship: \[ Y = b_2 X^2 + b_1 X + b_0 + \epsilon\] This is a multiple (or quadradic) regression. You can adjust the coefficients in the plot below.

#| standalone: true
#| viewerHeight: 500
library(shiny)
Nmax <- 1000
X <- rnorm(Nmax)
Err <- rnorm(Nmax)
ui3 <- fluidPage(
inputPanel(
  selectInput("N3", label = "Sample Size:",
              choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
  sliderInput("b2", label = "Quadradic Term Slope:",
              min = -1, max = 1, value = .5, step = 0.05),
   sliderInput("b1", label = "Linear Term Slope:",
              min = -1, max = 1, value = 0, step = 0.05),
   sliderInput("b0", label = "Intercept:",
              min = -1, max = 1, value = 0, step = 0.05),
   sliderInput("tau", label = "Error Standard Deviation:",
              min = 0, max = 1, value = .5, step = 0.05)
),
mainPanel(
  plotOutput("plot3")))

server3 <- function (input,output) {
  output$plot3 <- renderPlot({
  Y <-  input$b2*X*X + input$b1*X + input$b0 + input$tau*Err
  rho <- cor(X,Y)
  plot(X[1:input$N3],Y[1:input$N3],xlab="X",main=paste("Correlation =",rho))
  abline(a=input$b0,b=rho,col="red")
  lines(lowess(X,Y),col="blue",lty=2)
},width=288,height=288)
}
shinyApp(ui=ui3,server=server3)

If we try to run a linear regression when the relationship is curved, it will only tell us part of the story. The story it will tell is the red line, and not the blue curve.

Broken Lines

Sometimes the reltionship changes somewhere through the range of the data. Often this is a ceiling effect: the effect of \(X\) on \(Y\) hits a ceiling. For example, in the first couple of years of teaching, the ability of new teachers rises very rapidly as they gain experience. But after 3–5 years, the effect levels out and the teachers grow much more slowly.

Ideally we would fit two linear regression to these data splitting at a certain value of \(X\), \(x_0\). So,

\[ Y = \begin{cases} b_{11} X + b_{01} + \epsilon & \text {when} X \leq x_0 \\ b_{12} X + b_{02} + \epsilon & \text {when} X \ge x_0 \end{cases} \]

#| standalone: true
#| viewerHeight: 500
library(shiny)
Nmax <- 1000
X <- rnorm(Nmax)
Err <- rnorm(Nmax)
ui4 <- fluidPage(
inputPanel(
selectInput("N4", label = "Sample Size:",
              choices = c(25, 50, 100, 250, 500, 1000), selected = 100),
  sliderInput("b11", label = "First Slope:",
              min = -1, max = 1, value = .5, step = 0.05),
   sliderInput("b12", label = "Second Slope:",
              min = -1, max = 1, value = 0, step = 0.05),
   sliderInput("x0", label = "Crossover Point (x[0])",
              min = -1, max = 1, value = 0, step = 0.05),
   sliderInput("tau1", label = "Error Standard Deviation:",
              min = 0, max = 1, value = .5, step = 0.05)
),
mainPanel(
  plotOutput("plot4")))

server4 <- function (input,output) {
  output$plot4 <- renderPlot({
  b11 <<- input$b11
  b12 <<- input$b12
  x0 <<- input$x0
  b02 <<- (b11-b12)*x0 
  Y <<- ifelse(X<x0, b11*X, b12*X + b02) + input$tau1*Err
  rho <<- cor(X,Y)
  plot(X[1:input$N4],Y[1:input$N4],xlab="X",main=paste("Correlation =",rho))
  abline(a=input$b0,b=rho,col="red")
  abline(b=b11,a=0,col="blue",lty=2)
  abline(b=b12,a=b02,col="blue",lty=2)
},width=288,height=288)
}
shinyApp(ui=ui4,server=server4)

Check out this page to practice identifying these.