Diligencia is a specialist information services provider focused on company data within emerging markets, covering over three million company records across 67 countries in the Middle East and Africa.
Their mission is to provide regions poorly served by reliable and accessible public information, with accurate, actionable, fact-based insights into new markets, counter-parties, or suppliers.
Diligencia wanted to integrate a new source of data - legal records in Arabic - into their proprietary reference dataset containing organisation names, people names, and the nature of the relationships between them.
This presented a classic data wrangling challenge: they wanted to merge datasets, add in new rows for new entities, and update existing entities with new information without adding duplication.
How Diligencia used Hivemind
Diligencia started with a list of entities from the legal records but they didn’t know whether they were people or organisations, or how they should link them into their existing dataset. They approached the problem by integrating ClarifiedBy—their proprietary search engine—with Hivemind, in a man and machine chain.
First, each entity was passed through ClarifiedBy to produce two lists of likely candidate matches: one for people and one for organisations. These results were fed through the Hivemind API. A microtask was created for each entity and these were presented to an in-house team of contributors in the Hivemind interface. The contributors were asked to judge whether there was a correct match for the entity they were presented with, and if so, what that match was. The validated results from the Hivemind Task were then hooked into a process to update Diligencia’s database, as required.
This use case demonstrates the contrasting capabilities of man and machine. The search engine provided a great way of filtering down a huge list of potential matches for each entity - something that would have been intractable for the human contributors to do manually. The contributors were then able to apply their contextual knowledge and natural intuition to the problem of finding a precise match from a manageable list. This is a very similar method of information discovery and validation that we all use every day when searching for things online. Google presents us with a list of suggested options, and we then focus in on the best result for us.
For Diligencia, this approach meant they were able to integrate their new datasource into their reference dataset accurately, avoiding duplication, and in less than 20 seconds per entity.