Data Structure and Quality

There are about 25 gigabytes of data, released by the Washington State Liquor and Cannabis Board, including about 75 million entries in inventory logs and nearly 60 million transactions in retail sales. Presumably in part because of the size, the data is available to the public by request, but not directly on the LCB website.

Understanding Product Names

The dataset variable "productname" is the name given to the product by the retailer. This is a manually entered field, which does mean there are some errors, like spelling mistakes and also inconsistencies. For example, some include weight and some do not, and those that include weight may say "1/2 gram", "1/2 g", "0.5 g" or "0.5 grams". Also there are names that are overly board like "vendor sample" or simply a strain name like "Blue Dream." These inconsistencies and errors mean that using product names as something to group our data on is not perfect but it is a strong enough proxy for these purposes.

Data Cleaning

The combination of the size of the files (the largest of which is 13 gigs, with many coming in at 1 to 4 gigs) plus the lack of clarity as to how they relate to each other has posed a challenge. For example, each table generally has its own variable called `id` and the `id` variable in the `inventory` table matches the `inventoryid` variable in other tables, however, other times these connections are not as clear. We have taken several steps to address this.

We have used samples of the data to test our theories about how to join different tables. Then we check certain variables to see if the results make logical sense.

We have reached out to other researchers using the same dataset (and facing the same problems) to share their understanding of the variable names and connections, as well as to the LCB itself.

This has been an iterative process. As we learn more, we're then able to find other relationships between the data, that then also raise new questions. Also as our understanding grows, we often return back to our research and regulator contacts to see if our logic makes sense to them, and to try to understand our next steps.

Our files are available on github.

Assessing the Quality of the Extracts Data

Considering how robust the dataset is, the overall data quality is very high. However, when focusing on specific analyses, some aspects of the data are of higher quality than others. For the purposes of the classifying extracts, we will focus our assessment of the data on the specific variables for this analysis.

Since the database is overseen by the LCB, rules are in place to ensure that the data is Complete, Correct and Coherent. Though there is certainly room for additional oversight, the data is of very good quality.

Complete
Specifically for the extracts classification, approximately 5% of the necessary data is missing. The size of the dataset makes this number insignificant.

Coherent
Checking for coherency has been a way that we've been checking our joins as we work through these different tables trying to figure out how they relate to each other. The data is coherent. However, there are occasionally some anomalies that don't make sense, for example, non-retail locations showing up in file of retail transactions. These tend to be minor (in this example, three instances out of 50 million transactions).

Correct
Similar to completeness, the data is largely correct. However, it is not easy to know if there are intentional or unintentional errors. On the whole, given the volume of the number of records, the data seem to be generally correct.

Accountability
The entire purpose of this dataset is to create accountability of the legal marijuana market. The LCB has established rules to ensure that these data accurately monitor the transfer of products through the supply chain. That said, data are entered into a database by different parties invested in the success of the market. Invested parties are responsible for entering accurate data into the database and the LCB is responsible for ensuring that these data are correct. Thus, interacting with the data is a continuous exercise in improving accountability.