Overview
The Why
The How
The Security

The tl;dr

You've been giving us a lot of feedback about data generation times and features, and we've been listening. After a lot of R&D, we've now rolled out a new system for data generation in Gorilla, Codename: Data Hamster! This enables you to combine your data across experiment versions and across tasks/questionnaires within Gorilla.

In short, you can get the number of data files that you need to download down to just two!

The Why

Data is the key end result of the experiments you run on Gorilla. Just as Gorilla makes building your experiment faster and easier, generating and downloading your data should be fast and simple, giving you your data in the format you need.

Feedback from you, our users, and from the Gorilla team itself, was that we weren't hitting this requirement with data generation. As the number of active researchers, the complexity of experiments, and the number of participants being recruited went up, so did data generation times. Further, having data split between experiment tree nodes (tasks/questionnaires) and across experiment versions was proving arduous, requiring a lot of manual data merging. These, and other invaluable insights, were provided by the users who filled in our Data Collection survey early in 2023.

We knew we needed a better interface for selecting the data you want to download. To make that impactful, we first needed to make data generation faster overall. With our existing system, more filtering options would only mean longer wait times. Worse, we were starting to hit bandwidth limits within Microsoft Azure that would be expensive to overcome.

We needed a system that would

provide existing data generation faster
allow new data generation options with little-to-no impact on data generation times
provide reliable scaling as our user base grows into hundreds of thousands
have no increase in cost to you (science is expensive enough right now!)

I won't go into the details of all the options we considered and their various pros and cons. That's a much longer and more technical article. What I will do is introduce you to the solution we've implemented, codename: Data Hamster.

The How

Built for ingesting, the Hamster can process a lot of content in a small space, in a higher bandwidth environment. This allows the Hamster to process many more requests concurrently than our existing data generation system. Further, using the latest in stream feeding technology, we can run more complicated processing steps on individual lines and sections of data with only small reductions in speed.

In the short term, what this allows us to do is offer three new options for data generation:

Combine data across experiment versions. This will pool together all of the complete/included participants across all versions of the experiment into a single set of files. Using just this option, you will get one file per task/questionnaire, with each file containing all participants that have participated in that task/questionnaire across all experiment versions.
Combine all task data into one file. This will combine all of the tasks (for the selected experiment version) into a single file. Using just this option, you will get one file containing all of the data from all of the task nodes (for all the participants on the selected experiment version) and then one file per questionnaire.
Combine all questionnaire data into one file. This will combine all of the questionnaires (for the selected experiment version) into a single file. Using just this option, you will get one file containing all of the data from all of the questionnaire nodes (for all the participants on the selected experiment version) and then one file per task.

You can any use any of these options in combination with each other. Selecting all three would mean you'll get

One file containing all the task data from all participants across all versions of the experiment, and,
One file containing all the questionnaire data from all participants across all versions of the experiment.

That's just two files for your whole experiment!

Data Migration

To make Data Hamster as performant as possible with these new combining options, your data needs to be in a readily nom-able, easily processed format. Data from new participants collected on new and existing experiments will be automatically compiled into this format.

However, for existing experiments/participants, there needs to be an additional migration step, copying and converting all of your existing participant data into the new format. This migration step will happen in the background: while it's happening, there will be no disruption to your experiment or data collection. You will still be able to regenerate and download data files using the old system. Importantly, the migration involves copying the data files, so there is no risk of data loss.

The Security

All content remains within Microsoft Azure, covered by all of their and our existing data security requirements and regulations. The copying of data between existing storage and Data Hamster's storage tooling is done entirely within Azure - data is not manually accessed or interacted with by Gorilla staff. All existing regulatory agreements and MSA's are preserved.

With everything we create at Gorilla, security is our first and foremost consideration. We are constantly updating and revising our existing content and new tooling to use the latest security features and recommendations.