This is less of an OpenAPS/DIYPS/diabetes-related post, although that is normally what I blog about. However, since we created the #OpenAPS Data Commons on Open Humans, to allow those of us who desire to donate our diabetes data to research, I have been spending a lot of time figuring out the process from uploading your data to how data is managed and shared securely with researchers. The hardest part is helping researchers figure out how to handle the data – because we PWDs produce a lot of data . So this post explains some of the challenges of the data management to get it to a researcher-friendly format. I have been greatly helped over the years by general purpose open-source work from other people, and one of the things that helps ME the most as a non-traditional programmer is plain language posts explaining the thought process by behind the tools and the attempted solution paths. Especially because sometimes the web pages and blog posts pop higher in search than nitty gritty tool documentation without context. (Plus, I’ve been taking my own advice about not letting myself hold me back from trying, even when I don’t know how to do things yet.) So that’s what this post is!
QOTD: "55MB gzip-compressed […] you certainly stress-tested this"
— Dana #hcsm #OpenAPS (@danamlewis) January 26, 2017
Background/inspiration for the project and the tools I had to build:
We’re using Nightscout, which is a remote data-viewing platform for diabetes data, made with love and open source and freely available for anyone with diabetes to use. It’s one of the best ways to display not only continuous glucose monitor (CGM) data, but also data from our DIY closed loop artificial pancreases (#OpenAPS). It can store data from a number of different kinds and brands of diabetes devices (pumps, CGMs, manual data entries, etc.), which means it’s a rich source of data. As the number of DIY OpenAPS users are growing, we estimate that our real-world use is overtaking the amount of total hours of data from clinical trials of closed loop artificial pancreas systems. In the #WeAreNotWaiting spirit of moving quickly (rather than waiting years for research teams to collect and analyze their own data) we want to see what we can learn from OpenAPS usage, not only by donating data to help traditional researchers speed up their work, but also by co-designing research studies of the things of most value to the diabetes community.
Step 1: Data from users to Open Humans
I thought Step 1 would be the hardest. However, thanks to Madeleine Ball, John Costik, and others in the Nightscout community, a simple Nightscout Data Transfer App was created that enables people with Nightscout data to pop it into their Open Humans accounts. It’s then very easy to join different projects (like the OpenAPS Data Commons) and share your data with those projects. And as the volunteer administrator of the OpenAPS Data Commons, it’s also easy for me to provide data to researchers.
The biggest challenge at this stage was figuring out how much data to pull from the API. I have almost 3 years worth of DIY diabetes data, and I have numerous devices over time uploading all at once…which makes for large chunks of data. Not everyone has this much data (or 6-7 rigs uploading constantly ;)). Props to Madeleine for the patience in working with me to make sure the super users with large data sets will be able to use all of these tools!
Step 2: Sharing the data with researchers
This was easy. Yay for data-sharing tools like Dropbox.
Step 3: Researchers being able to use the data
Here’s where thing started to get interesting. We have large data files that come in json format from Nightscout. I know some researchers we will be working with are probably very comfortable working with tools that can take large, complex json files. However…not all will be, especially because we also want to encourage independent researchers to engage with the data for projects. So I had the belated realization that we need to do something other than hand over json files. We need to convert, at the least, to csv so it can be easily viewed in Excel.
Sounds easy, right?
According to basic searches, there’s roughly a gazillion ways to convert json to csv. There’s even websites that will do it for you, without making you run it on the command line. However, most of them require you to know the types of data and the number of types, in order to therefore construct headers in the csv file to make it readable and useful to a human.
This is where the DIY and infinite possibility nature of all the kinds of diabetes tools anyone could be using with Nightscout, plus the infinite ways they can self-describe profiles and alarms and methods of entering data, makes it tricky. Just based on an eyeball search between two individuals, I was unable to find and count the hundred+ types of data entry possibilities. This is definitely a job for the computer, but I had to figure out how to train the computer to deal with this.
Again, json to csv tools are so common I figured there HAD to be someone who had done this. Finally, after a dozen varying searches and trying a variety of command line tools, I finally found one web-based tool that would take json, create the schema without knowing the data types in advance, and convert it to csv. It was (is) super slick. I got very excited when I saw it linked to a Github repository, because that meant it was probably open source and I can use it. I didn’t see any instructions for how to use it on the command line, though, so I message the author on Twitter and found out that it didn’t yet exist and was a not-yet-done TODO for him.
Sigh. Given this whole #WeAreNotWaiting thing (and given I’ve promised to help some of the researchers in figuring this out so we can initiate some of the research projects), I needed to figure out how to convert this tool into a command line version.
So, I did.
- I taught myself how to unzip json files (ended up picking `gzip -cd`, because it works on both Mac and Linux)
- I planned to then convert the web tool to be able to work on the command line, and use it to translate the json files to csv.
But..remember the big file issue? It struck again. So I first had to figure out the best way to estimate the size and splice or split the json into a series of files, without splitting it in a weird place and messing up the data. That became jsonsplit.sh, a tool to split a json file based on the size you give it (and if you don’t specify, it defaults to something like 100000 records).
FWIW: 100,000 records was too much for the more complex schema of the data I was working with, so I often did it in smaller chunks, but you can set it to whatever size you prefer.
So now “all” I had to do was:
- Unzip the json
- Break it down if it was too large, using jsonsplit.sh
- Convert each of these files from json to csv
Because I knew how hard this all was, and wanted other people to be able to easily use this tool if they had large, complex json with unknown schema to deal with, I created a package.json so I could publish it to npm so you can download and run it anywhere.
I also had to create a script that would pass it all of the Open Humans data; unzip the file; run jsonsplit.sh, run complex-json2csv.js, and organize the data in a useful way, given the existing file structure of the data. Therefore I also created an “OpenHumansDataTools” repository on Github, so that other researchers who will be using Nightscout-based Open Humans data can use this if they want to work with the data. (And, there may be something useful to others using Open Humans even if they’re not using Nightscout data as their data source – again, see “large, complex, challenging json since you don’t know the data type and count of data types” issue. So this repo can link them to complex-json2csv.js and jsonsplit.sh for discovery purposes, as they’re general purpose tools.) That script is here.
My next TODO will be to write a script to take only slices of data based on information shared as part of the surveys that go with the Nightscout data; i.e. if you started your DIY closed loop on X data, take data from 2 weeks prior and 6 weeks after, etc.
I also created a pull request (PR) back to the original tool that inspired my work, in case he wants to add it to his repository for others who also want to run his great stuff from the command line. I know my stuff isn’t perfect, but it works and I’m proud of being able to contribute to general-purpose open source in addition to diabetes-specific open source work. (Big thanks as always to everyone who devotes their work to open source for others to use!)
So now, I can pass researchers json or csv files for use in their research. We have a number of studies who are planning to request access to the OpenAPS Data Commons, and I’m excited about how work like this to make diabetes data more broadly available for research will help improve our lives in the short and long term!