Data Structure and Quality
There are about 25 gigabytes of data, released by the Washington State Liquor and Cannabis Board, including about 75 million entries in inventory logs and nearly 50 million transactions in retail sales. Presumably in part because of the size, the data is available to the public by request, but not directly on the LCB website. The data is released in 28 tables, without a codebook or a relational diagram. This simplified entity-relationship diagram displays roughly how some of the tables relate to each other.
Data Cleaning
The combination of the size of the files (the largest of which is 13 gigs, with many coming in at 1 to 4 gigs) plus the lack of clarity as to how they relate to each other has posed a challenge. For example, each table generally has its own variable called `id` and the `id` variable in the `inventory` table matches the `inventoryid` variable in other tables, however, other times these connections are not as clear. We have taken several steps to address this.
We have used samples of the data to test our theories about how to join different tables. Then we check certain variables to see if the results make logical sense.
We have reached out to other researchers using the same dataset (and facing the same problems) to share their understanding of the variable names and connections, as well as to the LCB itself.
This has been an iterative process. As we learn more, we're then able to find other relationships between the data, that then also raise new questions. Also as our understanding grows, we often return back to our research and regulator contacts to see if our logic makes sense to them, and to try to understand our next steps.
Our files are available on github.
Data Quality
Complete
The data is complete in so far as it captures every legal transaction in the market. That said, it's possible that there is missingness if there is any product being diverted, lab tests being fudged, etc.
Coherent
Checking for coherency has been a way that we've been checking the accuracy of our joins as we work through these different tables trying to figure out how they relate to each other. Therefore, the data is generally coherent, because major deviations indicate something wrong with how we handled the data. However, there are occasionally some anomalies that don't make sense, for example, non-retail locations showing up in file of retail transactions. These tend to be minor (in this example, three instances out of 50 million transactions).
Correct
Similar to it's completeness, the data is largely correct, but it is not easy to know if there are intentional or unintentional errors. On the whole, given the volume of the number of records, the data seem to be generally correct.
The purpose of these data are to provide tracking and accountability of the system to the state regulatory system. The data is held accountable to all the business owners through the chain -- growers, processors, retailers, and testers.