My contribution so far to this thread has been the bubble/scatter chart (http://ideapad.org.uk/drwong/drwong_correl.html) providing a visual viewpoint of the FFT vs Staff survey results.

I had a couple of hours spare so I’ve created a survey search/graphing tool bringing together GMC trainee satisfaction results (thank you Dr Wong for the direction here), FFT and NHS staff survey results. http://ideapad.org.uk/drwong/search/

My thoughts on the topic of using NHS data more effectively sit mainly around the formatting of the data. Many of the data releases are still using excel or some clunky table generators which has had near zero effort made on improving accessibility for people interested in understanding the inner workings of the NHS.

To create the search/graphing tool I had to spend majority of the time cleaning the data up before importing into a database for merging together the survey results.

I wish the NHS would involve Geeks and lay people in the process of releasing data to the public.

(cross posted with carlplant.com)

]]>Interesting stuff.

A few observations:

**Validity of FFT**

- FFT/net promoter score is a poor measure – technically (it performs badly) and theoretically (it doesn’t capture the thing people think it captures) – (see Graham & MacCormick 2012, p11) I don’t think it should be used.
- The correlation of 2 poor measures (or a poor measure with a well-validated measure) doesn’t provide useful information. It may be better to point out that one of the measures is crap and not worry too much about what a crap measure is telling you. At least that way you might get people to focus on developing a better measure. (There’s some solid science behind developing and validating survey questions that seems to have been ignored by NHS England in developing the FFT).

**Some statistics things**

- If NPS wasn’t a crap measure, you would still not know that any relationship between it and staff survey wasn’t confounded by something else – like the severity of illness of the patients. We know that patients receiving treatments during times of serious illness are more likely to report positive experiences of care (happy to supply refs if helpful). I also suspect that staff who work at quaternary referral hospitals may be happier to work where they work by virtue of the fact that they probably sought out those specific employing hospitals. It’s thus interesting to see that it’s specialised trusts (Papworth, Marsden, Natonal Hospital for Rheumatological Disease, etc.) that generally end up in your upper right quadrant. You could propose a load of other things that might be responsible for influencing the results of both measures on your graph but that aren’t acknowledged in the data (case mix of patients, regional variations in grumpiness, etc. etc.). To address this you would need to use multivariable regression to control for potential confounders. And even then you’d still run the risk of confounding by an unobserved variable.
- Cross-sectional data, even properly analysed, will only get you to correlation, not causation. You may be fine with this, but time series analysis (looking at the relationship over time) might improve your ability to say something useful about the relationship.

In summary – great work on the blog. Brilliant to see you converting your curiosity into stimulating material. Problem is FFT is a crap measure – GIGO. If you still wanted to use it though, you’d want to use something a bit more sophisticated (but not at all hard) like multivariable regression +/- time series analysis.

]]>Since you have posted about open source, why not use an open source stats programme like R, and post the data in a safe, open format, rather than the excel file, with macros in it, that you have?

You can get R from http://www.r-project.org, and the fantastic R IDE, R Studio, from http://www.rstudio.com, both of which are fully FOSS. If these base packages are not up to the job you need then you can easily write your own code, or go to CRAN, http://cran.r-project.org, to find a multitude of packages that will solve almost any stats problem for you.

OK, now you’re open source so what:

If you paste the bit between quotes into a R source file and run it it will show you some of the things that are relevant to you problem:

`"`

x <- rnorm(100, 50, 15) #generate a 100 normally distributed random numbers with mean 50 and sd 15

y <- x + rnorm(100, 0, 15) #generate another 100 by taking the first lot and adding a random number to each with mean 0 and sd 15

fit1 <- lm(y ~ x) #fit y on x with a linear model

fit2 <- lm(x ~ y) #fit x on y

print(summary(fit1)) #look at some numbers calculated from the models

print(summary(fit2))

plot(x, y) #plot a scatter diagram of the points

curve(coef(fit1)[1] + coef(fit1)[2]*x, add=TRUE) #add the regression line for y~x

plot(x, resid(fit1)) #plot the residuals against x

abline(0,0)

#NOT RUN

#lm #gives you the code that actually runs when you use the lm() function

"

Then you can get to work on importing your file into R and doing the same.

You'll see that it will calculate R^2, the F-statistic and a p value for you in a single step, as well as the regression coefficients and p values for them. And if you want to know how it does it you can easily find out, because it's open source.

But just because you can, doesn't mean you should. For example, the two versions of the graph presented have the x and y axes interchanged. This matters, because the regression line is not the same for both versions, as you can see from my fake data version above. It also fundamentally affects how you interpret the results. There is evidence of correlation between storks and births: doi/10.1111/j.1365-3016.2003.00534.x. But what is your theory? There are at least three possible theories, which in cartoon form are:

- Grumpy staff give poor service, so that the patients rate the hospital lower,
- Grumpy patients make the staff dislike their job,
- Staff and patients live in the same area and living in certain areas makes everyone grumpy.

You also need some idea of whether linear regression is a reasonable thing to do. I've used fake normally distributed data, and have randomly distributed residuals, as can be seen from the residual plot, so it is. But the real values cannot go above 100 or below 0.

Your point about outliers is important, because these can have a large effect on the results of linear regression. Detecting them and what to do about them is a whole other problem though.

Let me know how you get on.

]]>