data

Improving Harmony’s PDF extraction with user testing

Improving Harmony's PDF extraction with user testing

Since we built Harmony, a common complaint has been that it frequently identifies the wrong questions in PDFs. The original algorithm for finding questions in PDFs was a mixture of rule based heuristics and some hand coded logic to look for e.g. lines in the document which begin with numbers. This was very fragile and worked fine on short questionnaires such as the GAD-7, but failed on larger documents.

We decided to run a competition with our partner DOXA AI where members of the public could train their own model to extract questions from PDFs.

We provided a few hundred examples of manually annotated training data and there was some held back test data for evaluating the PDFs.

The competition was won by Aashvin Chandru Relwani, with Siddhant in second place.

However, we then needed to do some user testing of the new PDF model in the Harmony app so that it could be compared and confirmed that the new model is better.

We deployed two versions of Harmony side by side: the original Harmony, and the new PDF extraction model on a new URL.

A user at Ulster University evaluated the new model against the old model on a number of new questionnaires and counted which questions were correctly extracted.

We have then run a statistical analysis on the two models. The analysis is here: https://github.com/harmonydata/test_new_pdf_model

You can view the results of our analysis as an R markdown notebook here: https://rpubs.com/fastdatascience/harmony-pdf

From doing a matched pairs t-test, we found that the old model had accuracy 41% and the new model had accuracy 95%, with p = 0.00039.

Next actions

Coming soon - we will update the web Harmony with the new PDF model.

Related Posts

Harmony at MQ and DataMind Data Science Workshop

Harmony at MQ and DataMind Data Science Workshop

Harmony at MQ and Datamind Data Science workshop On 2 May 2025, Dr Eoin McElroy demonstrated Harmony at the MQ and Datamind Data Science workshop in Deutsche Bank. Eoin’s presentation focused on “Maximising the use of existing survey data: facilitating cross-study research using retrospective harmonization.” The workshop brought together researchers interested in applying novel harmonisation techniques to existing datasets. Eoin explained traditional harmonisation processes and presented a user-friendly guide to the Harmony tool, demonstrating how natural language processing can streamline the harmonisation process.
'Send to Harmony' Chrome plugin

'Send to Harmony' Chrome plugin

[Beta mode: we are currently testing this extension] We have developed a browser extension for Harmony called “Send to Harmony” which lets you send selected text to Harmony with a right-click. For PDFs, use the popup to paste your selected text. Send to Harmony enables users to send selected text to the Harmony Data Harmonization (https://harmonydata.ac.uk/) platform for analysis. This plugin provides a right-click or context menu item which allows users to easily bring text from into their harmonisations, making it easier to compare and analyze different measurement scales across research studies.

Signup to our newsletter

The latest news on data harmonisation project.

Please select all the ways you would like to hear from Harmony project:

You can unsubscribe at any time by clicking the link in the footer of our emails. For information about our privacy practices, please visit our website. We use Mailchimp as our marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices.