Our software is built on the basis of a particular, pragmatic approach to solving complicated data collection, labelling, wrangling and elicitation problems.

Using the collective intelligence of man and machine, Hivemind helps businesses with their large-scale data collection, data labelling and data wrangling challenges.

This approach breaks down complicated data problems into a ‘chain’ of conceptually simple Tasks. Each Task is then channelled to the best resource to solve it, whether computational or human. For example, Optical Character Recognition (OCR) and Named Entity Recognition (NER) would be solved computationally, and more complicated annotations or Tasks requiring specific expertise would be solved by a human contributor. 

Human contributors can be anonymous workers on a crowdsourced platform, workers from a managed outsourcing firm, or more skilled internal workers from your own business—or any combination of these. This way, the right team solves the right problem, with the simpler more trivial Tasks being performed by lower cost workforces, and expert input being used only where it’s really needed.

Each Task in the chain is made up of many microtasks (which we call Instances). The answers to all of the microtasks across the chain are then combined to build a new, clean, structured dataset.

When using this approach, it is vital to consider in detail how well each microtask is completed. For those completed computationally it’s a case of using cutting-edge methods and assessing whether their output requires downstream human validation; where a microtask is being answered by a human, the fact that humans get tired and make mistakes needs to be taken into account. To help with this, our software offers a variety of in-built data quality techniques ranging from automated normalisation and equivalency evaluation, through to the aggregation of independent human judgement using consensus-based or sampling-based approaches. 

Using this approach businesses can:

  • Create proprietary, useful, structured datasets from any source, no matter how messy or unstructured it might be.

  • Produce accurate, unbiased training data for the development of machine learning algorithms.

  • Enrich, clean, map and monitor their datasets in detail and at scale, without burdening their teams with manual work.

  • Make informed decisions using aggregated opinion or sentiment, or combine the expertise of their staff/external expert network to elicit probabilistic predictions of variables they care about.