Developing our evaluation process to support a higher standard of evidence

  • Over the last few months, we’ve been sharing insights on the mixed methods approach we take to monitoring and evaluation 
  • Quantitative evidence is an important part of this approach. We’re working with external partners to try to develop a higher standard of evidence for our programme
  • In this blog, we set out the purpose and approach of this quantitative evaluation, and get to grips with the challenges that come with defining a meaningful comparison group

We believe quantitative evidence is a crucial element of understanding impact. As an organisation we already have encouraging data of impact for children on the programme when we look at their outcomes before and after their time with us; however, this doesn’t tell us how much these children might have improved (or not) without our support. For this reason in 2020, we commissioned a major external evaluation with the UCL Centre for Education Policy and Equalising Opportunities (CEPEO) and the Helen Hamlyn Centre for Pedagogy (we refer to the team as ‘UCL’ hereafter) to work with us to attempt to develop a higher standard of evidence than our current ‘pre-/post-intervention’ method.

There are several significant challenges in trying to develop a more robust evaluation design. The most substantial is to define a meaningful comparison group for children on the programme – i.e. a group of children that we are not supporting, whose progress we can monitor as a point of comparison to the children who are on the programme. We spend a lot of time working with schools to identify the right children to enrol on the programme and so they are very much not a random cohort. We need to find a methodology that compares their progress to children with similar starting points and background characteristics. 

Scoping out a control group

Randomised control trials (RCTs) are sometimes used to overcome this challenge. In an RCT, you randomly select which children are allocated to the programme, and which children will be the ‘control’. WLZ could ‘randomise’ within a school - i.e. identify a cohort of 40 ‘WLZ-eligible’ children, and then enrol 20 at random. However, this would effectively cut the size of the programme by half in each school. As Link Workers have a cohort of 40 children, this would then require each of them to work across two schools. This is a relatively inefficient method of delivery, as our Link Workers would then be less ‘present’ for children in each school and they would also have to build twice as many school/staff relationships. 

Another approach would be to randomise by selecting a cohort of 50 schools to work with, and then only working with 25 of those schools (again, selected at random). This was not plausible as: 

  1. We did not believe there were sufficient schools we were not already working with in our existing ‘Zone’ in West London to create a suitable pool of schools to select from.
  2. We did not believe that many/any schools who were selected for the ‘control’ group would dedicate sufficient time to the cohort identification and ongoing monitoring of children’s progress that would be required to create an effective control group.

When tendering for the project, UCL proposed a propensity score matching methodology to mitigate this challenge. This methodology quantifies the ‘level’ of similarity between children on the WLZ programme, and non-WLZ children in existing partner schools, based on their background characteristics (e.g. prior attainment, Pupil Premium eligibility). It uses this propensity ‘score’ to match WLZ children to non-WLZ children. The aim is that the comparison group is sufficiently similar to provide a meaningful point of comparison for children on the WLZ programme.  

The approach to date 

We believed the propensity score matching approach was the most appropriate methodology to help identify a more meaningful comparison group for children on the programme. No approach will provide a ‘perfect’ comparison group, but this approach is an improvement on comparing progress of children on the WLZ programme vs all children nationally, or a sub-group based on only one shared characteristic (e.g. eligibility for Pupil Premium). The question was whether it would provide a ‘good enough’ point of comparison to enable confident assessments about whether or not children on the programme are making accelerated progress compared to other children.

We’re now two years into this evaluation, and to date it has been hard for us to draw a conclusion. UCL has noted that our method for identifying children to enrol is very thorough and implemented consistently across schools - this means that the children we work with have a higher number of ‘risks’ (e.g. surrounding attendance, pupil premium eligibility, attainment), than other children within the same school. This poses challenges for the evaluation as there just aren’t (m)any pupils like WLZ cohort members who aren’t part of the programme. When we select the ‘best match’ for children on the programme, our best comparators are children who may have one or two ‘risks’, but who are unlikely to have the range and complexity of barriers of the typical child on the WLZ programme.

Or, as UCL team have summarised the challenge:

“Observable (and, likely, unobservable) differences do remain between the [West London Zone] and comparison groups. This is the case even where we have taken quite an aggressive approach to identifying well-matched comparison groups at the expense of sample size – ultimately young people selected into WLZ are extremely different to their peer group and, as a result, it is extremely difficult to identify a suitable number of truly comparable individuals as a comparison group […] This is particularly the case because the comparison groups assembled generally have average characteristics that are correlated with stronger outcomes (educational and wider) than the WLZ participants to whom they will be compared.”

Next steps 

We remain optimistic that the evaluation may provide useful evidence. We are only part-way through the project and have a further two cohorts to assess - this will give us a larger sample size from which to draw conclusions. We are also considering whether other methodologies could give us an alternative comparison group, for example through regression discontinuity analysis, whereby we would compare children who were ‘just’ eligible for the programme to those who narrowly missed out on enrolment.

We all hope that evaluation will provide simple answers: i.e. ‘find what works and support us to scale this up’. However evidence is often more complex than this. When commissioning this evaluation, we knew there would be challenges with any methodology that we selected. We’ve benefited from working with an excellent evaluation team, but in practice we have still run up against an unsolvable question at this stage - is this methodology precise enough to compare the progress our young people are making against other similar children? We remain committed to testing our model against higher standards of evidence, and as we approach the end of this evaluation, we are considering the right way to understand and assess impact in the coming years.

together, every child and young person can flourish.