A few years ago I made a big deal about a tool I had created, converting someone’s web tool into a command line tool to be able to take complex json data and convert it to csv. Years later, I (and thousands of others, it’s been downloaded 1600+ times!) am still using this tool because there’s nothing better that I’ve found when you have data that you don’t know the data structure for or the data structure varies across files.
I ended up creating a repository on Github to store it with details on running it, and have expanded it over the last (almost) six years as I and others have added additional tools. For example, it’s where Arsalan, one of my frequent collaborators, and I store open source code from some of our recent papers.
Recently, I added two more small scripts. This was motivated to help researchers who have been successfully using the OpenAPS Data Commons and want to update their dataset with a later version of the data. Chances are, they have cleaned and worked with a previous version of the dataset, and instead of having to re-clean all of the data all over again, this set of scripts should help narrow down what the “new” data is that needs to be pulled out, cleaned, and appended to a previously cleaned dataset.
You can check out the full tool repository here (it has several other scripts in addition to the ones mentioned above). The latest are two python scripts that checks the content of an existing folder and lists out the memberID and filenames for each. This is useful to run on an existing, already-cleaned dataset to see what you currently have. It can also be run on the latest/newest/bigger dataset available. Then, the second script can be run to compare the memberIDs and file names in the newer/biggest/larger dataset against the previously cleaned/smaller/older dataset. Those that “match” already exist in the version of the dataset they have; they don’t need to be pulled again. The others don’t exist in the current dataset, and can be popped into a script to pull out just those data files to then be cleaned and appended to the existing dataset.
As a heads up specifically for those working with the OpenAPS Data Commons, it is best practice to name/describe the version of the dataset via the size. For example, you might be working with the n=88 or n=122 version of the dataset. If you used the above method, you would then describe it along the lines of taking and cleaning the n=122 version; selecting new files available from the n=183 version and appending them to the n=122 version; and the resulting dataset is n=(122+number of new files used).
Folks who access the n=183 version of the dataset and haven’t previously used a smaller version of the dataset can reference using the n=183 and clarifying how many files they ended up using, e.g. describing that they followed X method to clean the data starting from the n=183 version and their resulting dataset is n=166, for example.
It is important to clarify which version and size of the dataset is being used.
PS – this method works on other data file types, too! You’d change the variable/column header names in the script to update this for other cases.