Wednesday, July 30, 2014

cc0 vs. the world

Today I had some discussion that stirred up my desire to say, in a loud voice: "But look here, if it is credit you want, then the best way to get it is if your data are cc0, it's even better than cc-by!" So I tweeted:
TL;DR There are lots of hints, but apparently no direct studies that address this.

Two important bits of clarification based on my original thoughts. I was only interested in cc licensed data. I was not asking whether cc0 data is reused more than other cc- data, just whether cc0 data gets *cited* (yes, citations = bad metric, so use a generic "pointed to" perhaps) more than other cc- data, particularly cc-by.

The basic premise is that the best way to (ultimately) bring focus to your work is to make it completely free, and that this will bring more attention, in the long run, than requiring attribution.

"Unethical" people will use open data regardless of the license, however they want, so they are wash, and it follows that we can eliminate them from the conversation. If this premise is accepted (it is just a premise) then any possible mechanism that causes someone (i.e. an "ethical" person) to pause before they use a dataset will result in that dataset being less widely cited. Cc-by, however innocuous, is a mechanism that will cause some to pause. I'm not claiming this scenario as my idea, it is straightforward enough that many have thought of it. What's curious is that it seems that perhaps no-one has tested it explicitly.

Many thanks to all who responded with insights (see conversation by clicking on tweet) here is a list of tweeted links for future reference:


  1. > "Unethical" people will use open data regardless of the license, however they want, so they are wash, and it follows that we can eliminate them from the conversation

    Hmmm... I appreciate this discussion is intended I presume for the context of academia but outside that domain I really don't agree with your statement there.

    e.g. for open government data they're usually delighted that data gets used. It really doesn't always matter if the use is 'cited' or not - it's meant to be used & use is good, regardless. That's why the data is openly provided - to be used, not to be cited.

    I think there's also an interesting discussion to be had over the subtle difference between rigorous and detailed acknowledgement of the provenance of data & 'citation' -- they're not the same thing in my mind. Scientifically, a super detailed methodology of the provenance & handling/transformation of the data used is probably more *useful* in terms of reproducibility than a simple 'citation' (in terms of what I'm thinking of). I find citations rather constraining in terms of what can be expressed in the reference list I guess. That's not to say you can't have both, but just recognising that citation is a pretty crude & imperfect mechanism for credit-giving.

  2. No disagreement here. I was indeed trying to keep things narrowly focused, mostly for pragmatic reasons. There certainly won't be one parameter (citations) that allows us to point at all the downstream consequences of open data.

    I chose citations as a proxy for a broader concept of linkages, or pointers. In my thinking citations, or any linking mechanisms, for example provenance chains, are a sort of "gravitational" mechanism, that draws attention (and other data) towards some information (open data). Data with a lot of "gravity", regardless of whether the they are good or bad, should draw in other resources (e.g. users of the software that produced that data, grant funding, public scrutiny, critical reviewers, repeat experiments, etc.). This seems like a reasonable hypothesis (and again, I'm not claiming to have come up with it), but actually testing it requires some proxies to start with, like the old-school concept of citations.