Replication
As part of a PhD course, I replicated the paper by Danzer and Lavy titled “Paid Parental Leave and Children’s Schooling Outcomes” and published in the Economic Journal in 2018. This paper studies an extension of the maximum duration of parental leave in Austria and concludes that it had a large positive effect on the educational performance of sons of highly educated mothers, but a very negative impact on sons of low educated mothers.
As I explain in more detail below, my replication uncovers a straightforward methodological error that invalidates the main results in the paper. However, trying to get it published proved more difficult than expected! Initially, I submitted my short comment to the Economic Journal. As I was pointing out a factual mistake in the published paper, it seemed only natural that the comment should be published in the same journal as the original paper. Unfortunately they rejected (twice!) my comment, despite positive independent referee reports. I am very grateful that eventually my comment was accepted for publication in the Journal of Applied Econometrics, which includes a dedicated Replication Section.
I am sharing publicly my experience in the hope that more and more journals become open to replications. And despite all the challenges that I faced, I still encourage researchers to do replications and share their findings. This is crucial for the progress of science. Each study contributes a small piece to the general knowledge in a certain topic, and if that is the common aim, replication work should be just as welcome as original research.
Summary
Danzer and Lavy (2018) study how an extension of the maximum duration of parental leave in Austria affected children's educational performance using data from PISA. They find no statistically significant effect on average, but highlight the existence of large and statistically significant heterogeneous effects that vary in sign depending on the education of mothers and children’s gender. The policy increased the scores obtained by sons of highly educated mothers in Reading by 33% of a standard deviation (SD), with a standard error of 15%, and in Science by 40% SD (st. error=11%). On the contrary, sons of low educated mothers experienced a decrease of 27% SD in Reading (st. error=13%) and 23% SD in Science (st. error=13%).
When replicating their study, I realized that the authors had not followed the recommended procedure to deal with PISA data. Similarly to other international large-scale assessments such as TIMSS, PIAAC and PIRLS, due to imputed values and stratified sampling, PISA provides several different plausible values for each score, each one representing a random draw from the posterior distribution, which need to be taken into account in the estimation. Danzer and Lavy only used one of the five plausible values available, and they did not take into account the Balanced Repeated Replication (BRR) weights.
The main results of the paper change substantially once the analysis is performed correctly. The large and statistically significant heterogeneous effects highlighted in the paper become substantially smaller in absolute magnitude and the associated standard errors increase in size, leading to the estimates becoming statistically insignificant and largely uninformative. According to the corrected results, the educational performance of sons of educated mothers increased by 13% (st. error=23%) in Reading and by 21% SD (st. error 21%) in Science, while for sons of low educated mothers the point estimates are -21% SD in Reading (st. error=17%) and -21% SD (st. error=16%) in Science. (you can read more in the paper)
It would be unfair to single out Danzer and Lavy for making this methodological error. I conducted a survey of papers using data from international large-scale assessments that were published between 2000 and 2019 in top economic journals, and found that out of 56 papers, 35 of them do not follow the recommended procedure for analysis. In large samples this is unlikely to affect results but, when the sample is relatively small, as in this paper, it might have important consequences. This problem is likely to be exacerbated in the presence of a publication (or author) bias favoring large significant estimates.
EJ initially rejected my submission on the grounds that “the policy among editors is not to accept comments”. After I shared this story on Twitter in 2021, EJ invited me to appeal, with the promise that if my findings stood, then the comment would be published. However, the comment was eventually rejected again, apparently based on a report from the original authors that the editor requested me not to share. The authors acknowledge that their original analysis was not correct but they argue that if they slice their data in a different way they can still get some stars for the subsample of mothers with the lowest educational level (10% of the sample). This analysis does not adjust for multiple testing, does not justify why the data should be sliced in this particular way, nor provide any evidence that this subset of women with very low labor market attachment actually qualified for the parental leave extension.
Paradoxically, the editor also remarked that I am not the first one to point out to them that economists should use the correct procedure to analyse PISA data. A paper by Jerrim et al (2017) that I cite pointed out this issue using precisely as an example an article published by Lavy in the EJ in 2015.
Furthermore, EJ decided not to retract or publish a corrigendum of Danzer and Lavy (2018). I think this is unfortunate and is making a disservice to its readers, who deserve to be informed that (i) we cannot conclude, based on the evidence presented in the paper, that parental leaves affect children’s schooling outcomes and that (ii) estimations with PISA data need to use the appropriate procedure. The fact that the previous comment by Jerrim et al (2017) was ignored both by the EJ and by the authors suggests that, as a discipline, we need to do better to ensure that errors in published papers are acknowledged and corrected.
Here is a timeline of the process:
22 December 2020 - I submit my comment to EJ. I also try to contact the original authors, but I do not receive any reply.
28 June 2021 - after more than six months without any news, I contact the editor in charge inquiring about the status of my submission.
30 June 2021 - the editor rejects the paper, having learnt that "the policy among editors is not to accept comments", and despite two very positive referee reports (R1, R3).
1 July 2021 - frustrated and worried by this response, I share this story on Twitter.
The response by many other researchers and even journal editors was overwhelmingly positive and supportive, providing reassurance that the attitude of EJ does not reflect the opinion of all scholars in our discipline.
3 July 2021 - I receive an email from one of the managing editors of EJ requesting an urgent call in which I basically get told off for making the matter public instead of following the official method of appeal, for making EJ look bad etc. I am urged to appeal the editor's rejection through the official channels.
7 July 2021 - I send the letter of appeal.
10 August 2021 - I receive the decision on the appeal: they conclude that, "in its current form, your note falls below the level of contribution for a stand-alone publication at the EJ." They then suggest two options:
To write a very short corrigendum note showing how applying the recommended procedure affects DL results. "Importantly, if you decide to submit this corrigendum, please submit it with the full replication package, allowing to replicate your results. The editorial board intends to invite the authors of the original paper to review your note together with the replication package. If your findings stand, while DL’s do not, the EJ will publish the note."
To write a full article replicating all papers that do not follow the recommended procedure and possibly to submit it to The Economics of Education Review.
1 September 2021 - choosing option 1, I resubmit the shortened version of the comment to EJ. (in further correspondence, I get informed that after all "If accepted your article would be published as a comment, rather than as a corrigendum.")
16 December 2021 - the editor rejects the paper (with R1 suggesting an R&R and R2, the original authors, "raising first order issues").
One reason is that Jerrim et al (2017) already discussed how Lavy (2015) had not followed the correct procedure, so that my remark towards Danzer and Lavy (2018) does not constitute an original contribution.
In their reply, the original authors acknowledge the mistake in their original paper, but they argue that when they slice their data in a different way, grouping mothers in three groups depending on their education (instead of two), they find significant negative results for the group of children whose mothers have very low education (around 10% of all children). The impact on the remaining 90% of children is not significant. The analysis does not include any adjustment for multiple testing or any justification for why the data should be sliced in this particular way.
Based on this new finding, the editor argues that the original claims of the paper hold, and the comment is not warranted. In the editor letter, there is also an explicit request to keep the matter private, as "R2 is demanding not to make his/her report public given that this is the subject of ongoing research and we would be grateful if you could comply with this request."
In September 2023 I received an unexpected email from EJ requesting me to share the documents from the final revision round, as they believed that showing in my webpage only the reports of the first round provides an inaccurate description of the process. This is why I am sharing them now.
March 2023 - the paper is accepted for publication in the Journal of Applied Econometrics.