Okay, in this video, we're going to work through an example of regression using fabricated data. I produced fabricated data because something I like to do when we're just demonstrating and analysis because I want to be able to tell you in advance that the data will meet the assumptions of, of our analysis. And so when you see, when we go through the process of testing those assumptions, I want to give you a sense of what some residual plots can look like, even for quote, unquote, perfect data. Okay? The biology that I've imagined for these data is that we want to look at the effect of axon thickness on signal velocity. And there's three main points I want you to get from this video. The first is, I'm hoping that you'll notice the very strong similarity between the, this analysis where we're considering a covariate compared to previous analyses where we considered a single factor. The second thing is that I'm going to be introduced in a concept called r squared, which is just one of the bits of output that we get from our, from the summary of our analysis. And I'm also going to show you how can we can obtain an equation for a line that we can fit to our data. With all that in mind, let's go to R and start looking at our data. The data are in this CSV file called axon dot CSB. So we'll just obtain those data nets. Now. Let's now look at the structure of these data just to get ourselves oriented. So we can see we have 40 observations from two different variables. One of the variables which is called the thickness is numeric. Whoops, it's numeric. And our other variable is called speed, and it's also numeric. So we have two continuous variables here. Let's, now we'll just actually look at the top the DataFrame. So head exon. Here we go. So here's how our, this is how our data are set up. So for an axon that had this particular thickness of 5.13, the speed of the, of the signal was measured as this. Ok, so we have data where the speed is associated with an axon of a particular thickness. So those are the data that we're looking at that we're interested in. Let's, let's just get a, let's just finally look at the whole datasets. Because I feel a bit strange to an analysis that actually looking at the whole dataset, I'm not looking for anything in particular. I just wanted to give you make this a little bit more real for you. Okay, so those are the data. Let's start by plotting the data. We plot the data in exactly the same way as we've done for single factor general linear models. For single factor general linear models, I didn't put the emphasis in the right place there. There is one of the bit of thought though that we need to really pay attention to previously with one factor, general linear models, we could be a bit thick in our minds when we were deciding what was going to be our Y variable. Because previously, we could just look for whatever call emit. Our dataset had numbers in it as opposed to categories. And then we could just choose that value or that column as our Y variable. That's, I'm not saying that's what anyone would have done, but we could've done that. I've been encouraging everyone to be thinking about our plots and our models in terms of thinking about what our hypothesis is. And based on that, deciding what our dependent variable is. And that same lesson holds true here. Our hypothesis is that the thickness of an axon will affect the speed of the signal that that axon is delivering. Okay? We're not saying that we expect that thickness will depend on speed. Here. We're not saying that our hypothesis is that thickness will influence speed. And so as a result, we need to put the y. We're going to say that speed is the dependent variable, and thickness is our independent variable. It is important how we decide how we decide which of these continuously distributed factor are continuously distributed traits at r0 is important? Which of these we assign to being independent and which of them we assigned to be dependent. Because a regression essentially assumes that the independent variable will be determining the dependent variable. And if we have the wrong way around than the inferences that we're going to, going to be making, it will not necessarily apply. The last thing we do is we say we want to name our DataFrame. So lets say, lets make this plot. Okay? One other thing we want to do is you can see here that our y-axis does not go all the way to 0. So let's just say y lim equals creative vector from 0 to a 120, because that's what we have is our highest point already. Okay. And I'm just going to go one step further. Say xlim goes from, so I'm going to determined the length, the range of the x axis. So x lim equals from 0 to, let's say 0 to 20. Okay? The reason I've done that is because when we inspect your data, I want to be able to make a general prediction about the slope and the intercept. And it's easier to see Tor, to imagine what the intercept would be if we can actually see this zero-zero point. Okay? So what do we see here? First of all, we can see that there's a pretty obvious relationship between thickness and speed. Okay? And it looks like thickness that as thickness increases, so does speed. We can also see that there are no outliers. And by outliers, I just mean, I don't mean data points that are wrong. I just mean that there are data points. There are no data points here that are surprising or unusual. We don't suddenly have a datapoint, a single data point right there. Whereas all the rest of the data points lie rate along with this nice line. Ok, so no outliers, obvious trend. What else can we learn from this? Well, one thing that we can learn is we could try to guess what the slope will be. You can see that the rise goes from 0 to about a 120. So over this range, the y values increase from 0 to a 120, and the x values increase from 0 to 20. So if we take, if you want to calculate our slope, we can take our rise over run, which is about a 120 over 20, which is equal to six. So based on this, we can predict that our slope is going to be approximately six. And we can guess what the intercept is as well. We can guess if we just kind of imagine where the intercept would be, looks like it's going to be pretty close to 0. Okay? Which could make sense because if you have a thickness of 0, then presumably an axon cannot carry a signal. So those, these are observations and those are our predictions. Let's carry those forward in our analysis. Ok. Let's now model our data. And what I want you to see here is the model works, is formulated in exactly the same way as we did when we were working with single factors. So we're using the function lm. I just copied the code from plots just because I thought it would speed things up a bit. And This is exactly what we want. We want speed as our dependent variable and thickness as our independent variable. We're still referring to our dataset. And we're going to call the output axon dot lm. Okay, that's we're going to save our output. Now. Let's run this. And now we want to inspect the residuals. We wants to check our assumptions. Let's remember our first two assumptions, which are the data are randomly selected and their independent. I can tell you that. I can tell you that they are because I generated the data in a way that would make sure that they are. And so are other assumptions are that we have equal variance and that the residuals are normally distributed. The one i c equal variance. We're no longer comparing variance among different groups because we no longer have groups. Instead, we are looking for equal variance, a long hour range, our x-axis, okay? In other words, we're looking, we're looking at checking that the variation in our residuals is consistent as you move from a small value and our x axis to a large value on our x-axis. I'll show you what I mean. So to check our assumptions, we do it exactly as we've done before. We say plot. And then we give the name of or objects that contains our output from the lm function. So say, here we go. And here is our first plot which allows us to test the assumption of equal variance. And what we're look, you can see that we no longer have data points that line up nicely in columns. And that's because we don't have factors anymore. What we're plotting here is the values of the residuals for each of our data points. As we move along. As you move along our line that we fit through the data. So the line that has been fit through the data, that line represents the fitted values. And so if we have a residual that lies above this 0 line, and that means we have a data point that lies above the fitted line. Here we have two points that lie below the fitted line, okay? And we know that they're below because they have negative residuals. What we're looking for here is to see or is what we're looking for here is we want to know whether or not the variation among these residuals, basically how spread out these, these residuals are vertically, is consistent as we move. From left to right. Okay. And to my eye, this looks absolutely fine. We have maybe one or a couple data points that are sticking out a little bit farther than we might expect. But that's totally fine. They're not sticking out to a, to a ridiculous degree. And so I would say that to my eye, the amount of variation that we have, the amount of vertical variation in our residuals is pretty well constant as you move along this x axis. So this tells us that our data and meet the assumptions equal variance or we have homogeneous variance. We should expect to that because I created or generated these data using a random number generator in a way such that we expect this to be true. Okay? So next we're going to test the assumption of normality. And that looks very normal to me. Okay, so all of our points a fall in really close the dotted line. And so we are very happy that our data meet the assumption of normality. Okay? So we now know that our data meet the assumptions of equal variance and normality. So let's look at our results. We can look at our results in the same way we did for one factor general linear model. Where we can generate a P-value and overall p-value by using the ANOVA function. And here we go. And you can see you have this very small p-value, which is telling us that the slope of the line that we fit through the data is significantly different from 0. So the null hypothesis in a regression, the null hypothesis that this p-value corresponds to is that the slope that we fit through our data will equal 0. If you get a p-value that is very low, then what that's telling us is that we have good reason to believe that the slope is not actually equal to 0. It's equal to something different from 0. It might be greater than 0, it might be less than 0. We don't know what that slope is yet. We're gonna get to that in a moment. Okay? So this P-value here, which is very small, it's less than 2.2 times ten to the negative 16. This indicates to us that we have good reason to believe that our slope is not equal to 0. The other thing I wanted to, I wanted you to see here is that even though we're running a different, we're running a model with a covariate. The, the output for our ANOVA table is. Formulated in exactly the same way as we found when we had a one factor general linear model. We're getting output for our sums of squares, our Mean Squares, our F value. And if you were to take this mean square and divide it by that Mean Square. In fact, let's do that just to, just to be complete. You will see that the result is what we get here. It's not exactly the same because there will be some additional digits here. And the values, the mean square that are, that are not being displayed. Bots that will have been used to calculate this value of d f square, this value, this f value. Okay? So we have strong reason to believe that our slope is different from 0. Let's figure what that slope is. Now. To do that, we're going to look at the summary of our output from the LM model. So we'll just say summary and then axon dot lm. And here we go. Let's start here. Ok. Or actually we'll start here, okay. Calm on. My trackpad is being difficult. So I want you to focus on what's called the adjusted R square. We also have a Multiple R-squared. We're not going to talk about the difference between those. But let's focus on this adjusted R square. What the R-square in general tells us is it tells us the proportion of the variation in our Y variable that is explained by our x variable. So that's what these r-squared values are referring to there telling us here in both cases, these the adjusted R squared and the multiple R square, they are very similar to one another. And both cases are telling us that about 90, 97% of the variation in speed can be explained by variation in thickness. Okay? That's generally what R-Squared refers to. Now, let's interpret this output here. Okay? The interpretation of this output is different from the interpretation of the outputs that we got when we looked at the summary of a one factor general linear model. In that case in previous videos where we said summary for a model that had a factor in it with multiple levels, the output looked like this, but it had a particular meaning, the n. In that previous case, the intercept. Referred to the mean value of some reference treatment or reference level that was chosen by r. And then the terms underneath it represented the difference between other levels of your factor and this reference level factor. And a regression. This intercept and thickness can be interpreted. And a slightly more intuitive matter. Because first of all, the intercept refers to the Y intercept. So it gives us our value of y that we expect to have when our, when the value of our covariate would be equal to 0. We didn't actually have any values of a covariate that were equal to 0. So this value here, this estimate of R, y of our y-intercept, is extrapolated from the line that has been fitted through the data. And because it's been extrapolated, we should take this term, this estimate here with a pinch of salt. Okay? But that's what this estimate means. It means it's the estimated y intercept for the line. They'd be fit through our data. This term here, this value of the, sorry, the estimate for thickness, which equals to 5.97. This refers to the slope of our line. And you remember when we're looking at our original plot, we estimated that slope of being near six. And sure enough, the estimated slope is equal to 5.97. Okay? I've taken a breath for some dramatic pause. And I'll say that at that point that actually really marks the end of our analysis. What we've been doing in previous analyses is go eat on a, calculating an effect size. But with a regression like this, this output actually gives us our effect size. The effect size that we're interested in is the slope, which is already given to us. And we also have a standard error given or a measurement of our uncertainty in this slope. Ok? So this slope is the effect size that we're most interested in. And this output gives us a standard error for that slope. So we already have our effect size. We also have a standard error for the y-intercept. In some analyses, we might really be interested in what this y intercept is and whether or not significantly different from 0. In which case, this value of the estimate and the standard error would both be of great interest. One thing I should mention here is what these P-values refer to. So this p value has the same, so this p-value for thickness has the same interpretation as the p-value that we got from the output from the ANOVA function. And that is, this is a test of whether or not the line that we fit to the data is likely to be different from a mean, from a slope of 0, because our null hypothesis is that the slope will be equal to 0. And this is the p-value for testing that hypothesis. This p-value for the intercept. It refers to a test of whether or not the intercept is significantly different from 0. And you can see that this p-value is greater than 0.05, much greater than 0.05. So this means that really, based on this p-value, we have no reason to think that this estimated y intercept would be different from a value of 0. Ok? So effectively, our Y intercept is effectively 0, considering that we have a large standard error around this estimate. Okay? I feel like I've said that not quite right. This estimate actually is our best estimate of of the Y intercept. However, if we take our standard error into account to consider the variation around this estimate, we can see that 0 is also a plausible estimate of our intercept. That's a proper wave seen this. Okay. So those our, our, our results. There's one last thing I'd like to show you. I, I created this code ahead of time. And I'd just like to show it to you. It's really nice to be able to create a nice plot of your data. And so I'm just going to show you what this plot looks like. I'm not gonna go through it in great detail. But I'll just quickly point out each of the, each of the options that we've that we specified. This first part here, we said speed is equal to thickness and datas exon, and we're using the plot command. That's exactly what we had earlier. Okay? What I've done is I've just added on a number of options. First of all, I said X Lab, which is our label for the x-axis. It's equal to this fiber, fiber diameter and micro meters. I could have given a label for the x-axis using the mtext command, which is what I showed in previous videos. In this case, I decided to do it this way. Similarly, I gave a label for the y axis with the y lab function or option. Okay, it's a conduction velocity in meters per second. I want to point out that we can also have this c x label, which I believe. Corresponds to the size of the font for access label. Let's just play around with that just to demonstrate what's going on there, let's make it three instead of 1.5. What we should see if I remember this right, is we should see that the words fiber diameter conduction velocity should get much bigger, and certainly they do. Okay. This option for c x axis that controls the font size of your access numbers. So let's make this look ridiculous, and we'll change that to 0.5. Just to illustrate this effect. And you can see the effect there. And then finally, I've got these options. Whoops. Where I say x lim equals this, and y lim equals that. This is where I've specified the X and Y axes. There could be an argument to say that we'd want to start both of these at 0. When you're producing plots like this. I don't like, are we showing these relationships? I don't think it's as important to show the y-axis or to, or to show the origin, the point we have 000 for the x and 0 for the y. Because doing that is not really going to change our perspective on the strength of the effect. When we're plotting data previously for analyses of factors, we want to make sure that we scaled our y-axis appropriately said we didn't exaggerate the effect of our treatments. I don't think that we're, we need to worry about that in the same way when we're presenting results from a regression. So here showing you you could change the the range of data to be starting at 0 and for both the X and Y axis, which is what I'm doing here. But I don't think that's necessary. This isn't particularly helpful. The last thing that I want to show you is this function AB line. And what it does is it just draws a nice line through your data. Where the first thing that you specify is the y-intercept. And then you specify the slope. And it will just draw a line through your data. So you have to enter the AB line function immediately after plotting the data so that R will know to add this line to what was in your plot to the output from this plot function. And it will fit align with this intercept y intercepting with the slope. Whoops. And finally, can say LWD, that's just refers to the width of the line, line width. And I'm just made a little bit thicker than standard by saying 1.5 where the standard would be one. Okay, and that's what I wanted to show you. There's one last bit of business to attends to. Whoops. And that is the question of how do you report the results of a regression? Well, first as always, you want to present a nice figure as I've done here. And you could present your results like this. You could have a statement, something along lines of conduction velocity increases significantly with fiber diameter. Then you notes that we were reading, we've run a regression. We're giving our degrees of freedom, which in this case is one new for our covariance. Covariance always have one degree of freedom. And then we have our residual degrees of freedom here being equal to 38. And then we can report our t value and our, our p-value during the wandering mind just blanks there. Alternatively, so I believe I got that from the output of the summary function. Alternatively, we could report the f value that we got from the output from the, from the ANOVA function. The other thing that we want to present is we want to present the information that gives us our equation for our line, where we have our slope plus or minus the standard error is given as 5.97 plus or minus a standard error of 1.160. And we want to give our estimate of our intercept, which is equal to 0.5 to plus or minus standard error of two. Okay, and we'll stop this video there. This has been a quick tour of how to conduct a regression analysis, including how to check your assumptions, how to plot the data, how to make a decent looking plot. One thing that I did not show in this video is how to add error bars around that line that we fitted through our data. And that is something that I'll show in a future video. We've also discussed how to interpret the output in terms of checking our high, assessing our P value to test the null hypothesis. But also to get our effect sizes, which are essentially our slope and our intercept, and how to get standard errors for each of those. I hope this video has been helpful and I'll stop there and say, thank you very much.