Support Home Analysis Publishing and Open Science Data Hamster

Data Hamster

  • Overview
  • The Why
  • The How
  • The Security
  • The When
  • And then what...
  • Access Data Hamster

The tl;dr

You've been giving us a lot of feedback about data generation times and features! In May 2024, following a year of R&D, we'll be rolling out a new system for data generation in Gorilla, Codename: Data Hamster! For existing experiments, this will require some data migration which, depending on how many participants you have, could take "a little bit of time" (~ tens of minutes). Your data will still be safe within the Azure infrastructure and covered by our existing, extensive data security policies. Once this is complete, you'll be able to combine your data across experiment versions and across tasks/questionnaires within Gorilla.

In short, you can get the number of data files that you need to download down to just two!

To get Data Hamster added to your account, you just need to get in touch via our Support contact form, choose 'My Account' and ask for the new data feature/Data Hamster.

This is the first of two major updates to data coming this year - we'll be talking more about the second half of this, Data Studio, soon!


The Why

Data is the key end result of the experiments you run on Gorilla. Just as Gorilla makes building your experiment faster and easier, generating and downloading your data should be fast and simple, giving you your data in the format you need.

Feedback from you, our users, and from the Gorilla team itself, is that we aren't hitting this requirement with data generation. As the number of active researchers, the complexity of experiments, and the number of participants being recruited has gone up, so too have data generation times increased. Further, having data split between experiment tree nodes (tasks/questionnaires) and across experiment versions was proving arduous, requiring a lot of manual data merging. These, and other invaluable insights, were provided by the users who filled in our Data Collection survey early in 2023.

We knew we needed a better interface for selecting the data you want to download. This would need to include filtering by experiment versions, specific participants, and individual tasks/questionnaires. To cover a reasonable range of your requests, it would need to also allow selecting of specific pages/screens/displays, and finally, filtering out whole columns of data.

To make that impactful, we first needed to make data generation faster overall. With our existing system, more filtering options would only mean longer wait times. Worse, we were starting to hit bandwidth limits within Microsoft Azure that would be expensive to overcome.

We needed a system that would

  • provide existing data generation faster
  • allow new data generation options with little-to-no impact on data generation times
  • provide reliable scaling as our user base grows into hundreds of thousands
  • have no increase in cost to you (science is expensive enough right now!)

I won't go into the details of all the options we considered and their various pros and cons. That's a much longer and more technical article. What I will do is introduce you to the solution we've implemented, codename: Data Hamster.


The How

Built for ingesting, the Hamster can process a lot of content in a small space, in a higher bandwidth environment. This allows the Hamster to process many more requests concurrently than our existing data generation system. Further, using the latest in stream feeding technology, we can run more complicated processing steps on individual lines and sections of data with only small reductions in speed.

In the short term, what this allows us to do is offer three new options for data generation:

  1. Combine data across experiment versions. This will pool together all of the complete/included participants across all versions of the experiment into a single set of files. Using just this option, you will get one file per task/questionnaire, with each file containing all participants that have participated in that task/questionnaire across all experiment versions.
  2. Combine all task data into one file. This will combine all of the tasks (for the selected experiment version) into a single file. Using just this option, you will get one file containing all of the data from all of the task nodes (for all the participants on the selected experiment version) and then one file per questionnaire.
  3. Combine all questionnaire data into one file. This will combine all of the questionnaires (for the selected experiment version) into a single file. Using just this option, you will get one file containing all of the data from all of the questionnaire nodes (for all the participants on the selected experiment version) and then one file per task.

You can any use any of these options in combination with each other. Selecting all three would mean you'll get

  • One file containing all the task data from all participants across all versions of the experiment, and,
  • One file containing all the questionnaire data from all participants across all versions of the experiment.

That's just two files for your whole experiment!

Data Migration

To make Data Hamster as performant as possible with these new combining options, your data needs to be in a readily nom-able, easily processed format. Once Data Hamster is switched on for all users, new participants collected on new and existing experiments will be automatically compiled into this format.

However, for existing experiments/participants, there needs to be an additional migration step, copying and converting all of your existing participant data into the new format. How long this takes will depend on how many participants you're migrating and how much data each participant has created. Typically this will be to the order of seconds/minutes, but for very large experiments it could be longer.

While migrating this data, there will be no disruption to your experiment or data collection. You will still be able to regenerate and download data files using the old system. Importantly, the migration involves copying the data files, so there is no risk of data loss.


The Security

All content remains within Microsoft Azure, covered by all of their and our existing data security requirements and regulations. The copying of data between existing storage and Data Hamster's storage tooling is done entirely within Azure - data is not manually accessed or interacted with by Gorilla staff. All existing regulatory agreements and MSA's are preserved.

With everything we create at Gorilla, security is our first and foremost consideration. We are constantly updating and revising our existing content and new tooling to use the latest security features and recommendations.


The When

On 23rd May 2024, we begin an early access period for a small number of users while we test the workflow and stability of the new data generation process. During this early access period, users will first be prompted to sync their data to the new system, which will run the manual copying step mentioned earlier. Then, the new data generation options will become available.

From the beginning of June, Data Hamster will be running data generation in parallel with our existing data generation service. This will allow us to further refine and load test the new system, without disrupting the existing data generation process. This will include passively loading existing participant data into Data Hamster as necessary. Further, the data generation options will be made available by request through the Support Contact Form.

Once we are satisfied that Data Hamster is reliable and performant enough, we'll fully switch over data generation to Data Hamster entirely. This is heavily dependent on what happens during the previous step, but could be as early as July if all goes well.


And then what...

Based on the results of our Data survey from 2023, we're confident that this update alone will resolve many of the pain points that users have experienced with data generation. However, with problems solved come problems new. Combining data from all experiment versions is great... until you need to exclude the participants from your pilot version. Combining data from all tasks into a single file is great... until you combine too many tasks together and the number of columns becomes unwieldy.

This is why, in parallel, we've been developing Data Studio - a new GUI tool for managing your data files, their content, and creating custom reports. This would allow you to select which participants, tasks/questionnaires, displays/screens/pages and columns you want to see in your data. Further, you'll be able to bundle these into report templates you can then reuse in other experiments, add to your Open Materials pages, and share with your colleagues and institutions.

We'll be talking more about Data Studio over the summer and will be rolling it out over the second half of this year.


To get Data Hamster added to your account, you just need to get in touch via our Support contact form, choose 'My Account' and ask for the new data feature/Data Hamster.