University of North Carolina at Chapel Hill
- Posts: 2
- Joined: Mon Apr 18, 2016 9:23 am
There are many things to like in this report. The report conceives content analysis as a series of interconnected "decisions" and provides perceptive suggestions about best practice.
I think it would be worth considering one extension of this report and one refinement.
The extension is that it transparency in data production should, I think, extend a) up to how the concept is specified and b) down to how ambiguities in coding are adjudicated. Estimating a political concept involves a series of theoretical, conceptual, operational, and coding decisions, each of which should be open before the user.
1) Specifying the measurement concept. Which meanings does one
wish to include?
2) Unfolding the concept into dimensions. How does one break down the
measurement concept into discrete pieces that can be independently
assessed and aggregated to capture its meaning?
3) Operationalizing dimensions. How does one conceptualize and specify
intervals on the dimensions? What rules allow one to reliably detect
variation across intervals?
4) Scoring cases. What information does one use to score cases? Where is
that information, and how can others gain access to it?
5) Adjudicating scores. How does one interpret gray cases, i.e. cases for which
scoring involves interpretation of a rule?
Making decisions about each of these transparent is a form of self help for scholars who manually code qualitative
source materials to the extent that it allows users to evaluate the assumptions that underpin a particular dataset. So one might extend norms about transparency "up" to the conceptual/theoretical decisions that frame the concept and "down" to gray cases. Perhaps a litmus test of transparency in data production is a discussion of ambiguous cases that inevitably occur in any dataset.
A refinement might be to modulate advice about when to release the full dataset. I was frequently asked to share the coding scheme for data that I was actively producing before having published! However, it can make sense to publish a dataset and the accompanying materials much earlier than two or three years after completion of the dataset. If one wishes to generate interest in a new field and if one's data has multiple uses, it may make sense to publish a dataset as a stand-alone article or book. The only way to generate feedback on a dataset is to let others see inside.
Finally, thanks to the authors for writing a thoughtful and balanced report.
Thank you for the guidelines, I only have a few comments.
1. The decreasing price of online storage and especially the availability of repositories should be weighed against the argument that digitalized raw material is expensive to maintain online. Some types of files such as scans are heavy, but everything that is not contractually protected should be accessible upon request.
2. I find that releasing internal deliberative processes is unrealistic, and as the guidelines note, the potential drawbacks outweigh the benefits. I do not agree, however, with how the report pictures one of these disadvantages. You write that anticipating posterior scrutiny researchers would opt to code more conservatively or in a less disaggregated manner. I do not think this is an important disadvantage; on the contrary, many times it may be a desirable effect. The increasing emphasis on effect size and robustness in our research also discourages inflating variation in our concepts to be exploited statistically, and this is a good thing. Of course, the academic culture should also accept that the consequences are more negative findings and more uncertainty. Additionally, coding conservatively and achieving intercoder reliability should not be seen as necessarily detrimental.
2.5 Preparing and maintaining coding sheets where all decisions are recorded is good practice. However, in long projects the format and information contained in them might evolve, and from the first to the last they might end up looking differently. Having to re-format them all for later publication is an effort that I do not deem worth the benefit. Expectations on the systematism of these documents should be kept in check.
3. Finally, it seems that the guideline values more access to original material and coding notes for the general public than for journals and reviewers. My impression (being one of the least experienced researchers in the discussion) is that reviewers might be interested in the sources just as much or even more than the larger research community.
Thanks for producing this report.
I have two reflections.
1) It seems right to recognize the right of journals to demand coded data. For large amounts of text, this cannot be very hard to accommodate in the main manuscript. Some journals appear to recognize that they need to host space for this sort of replication data on their websites (with extensive online appendices or supplemental material). However some journals do not provide this space when the material is qualitative. I would suggest recognizing that journals should work to similarly host online space for data that is not quantitative. Otherwise, those of use who use primarily qualitative material will be forced to put such replication data in appendices in our articles, eating up valuable word space.
2) On the first main recommendation, which concerns states that "scholars make their source material accessible," unless copyright or confidential prevents, I would suggest that another exception might be merited. Specifically, if the material is already publicly accessible it seems to be a misuse of resources to expect researchers to also make it accessible. For example, I have been working with judicial decisions, all of which are hosted on courts' websites. Is it really necessary for me to download all of the courts judgments from this public space to then only upload again on another website? So long as I adequately document where these decisions can be found, it is not clear me what added value I am provided.
1. Specify how many comments, etc. were used -- transparency about the sources of this report.
2. "Also, the content under consideration in this domain is largely historical and public (though not always), and some other concerns, such as those involving human subjects, are not as challenging" <rewrite this sentence - at first, it reads as if human subjects research is not challenging. Obviously not what you meant but it requires careful reading to realize that.
2. Also, here, lay out WHY certain issues are challenging and others aren't. How does "public" documents make easier? How does human subjects make harder?
3. "decreasing order of what we might think of as their level of cost" I am not sure there is any reason to rank these. Just lay them out.
4. "seems reasonable that those who employ multiple coders and redundant coding might want to report rates of intercoder correspondence" You make a judgment on releasing this type of info but not others. Why only this type of info? I think not making a judgment on any would be better -- just discuss the costs/benefits of releasing each, even if briefly. What are the primary tradeoffs of transparency vs. privacy for each?
4ff. I think the journal/scholarly community distinction is a good one but its one that is buried here. I would make this a structuring element of the report.
5. Asst of benefits. It would be nice to put a placeholder in here about the corresponding responsibility of users of data to prominently cite and acknowledge THEIR benefit from using someone else's data. I am somewhat unconvinced by the claim that "While these benefits are collective benefits, many of them redound to the individual researcher as well" -- especially a) if others don't cite the dataset clearly and acknowledge that their research would not be possible without it and b) the costs of producing the dataset are large.
5. "It takes time and resources to prepare, upload, and maintain documentation and data from a content analysis. In this respect, expanding transparency by putting source material, coded data, etc. online ought to be balanced against the time and effort this requires." My experience is that most of the time/resources is creating the dataset, not the making it public.
6. Explain how QDR relates to the more well-known ICPSR.
6-7. The real value of something like QDR is if QDR provided the resources to make datasets public. The resources to do so are large and collaboration could shift some of those resource demands off the scholar who created the data and, thereby, make transparency more likely.
7. "deliberate decision not to share this material online or is a remainder from the days when source material was physically preserved in offices, basements, and archives" OR simply avoiding the ADDITIONAL resource costs of making the dataset transparent.
8. Another cost that might be mentioned is that the quality of documentation needed to make coding decisions interpretable to the original researcher 2 years later is far lower than that needed to make them interpretable to others. My own notes on coding are solid but messy. The time it would take to make them non-messy would be enormous and that is a large deterrent to transparency for me and, I imagine, for other researchers.
8-9 Coded data section. I think an important suggestion is that USERS of data must be encouraged to provide HIGH VISIBILITY (first footnote, perhaps) citation to all datasets used. Too often, I see work that uses the IEADB and either doesn't cite it at all or mentions it in the text but not in the bibliography or buries it in footnote 32. That is annoying and really decreases the desire to provide transparency. There IS an obligation on the part of USERS that is not, in my experience, being met.
- Posts: 1
- Joined: Fri Nov 17, 2017 5:29 am
1. Your report does a great job at identifying the benefits and costs generated from increased data transparency. However, in discussing the tradeoff, I wonder if you sometimes place a little much emphasis on the costs, and on the interests of individual researchers? Naturally, the interests of individual researchers and the interests of the field cannot be separated, precisely for the kind of general equilibrium effects that you discuss, but if we are striving for improvement in data transparency, that will necessarily imply changes to current practice, and hence, to incentive structures. Such change may be costly, but arguably some of the costs would decrease over time, as practices among reseachers change and tools and standards for greater transparency develop. For example, it appears to me, that many of the costs (and especially (2) and (3) under Q3) would be reduced by adequate training and institutionalization of repositories, so that scholars would simply think of transparency as the “standard way” of doing research, much like we think of some of the transparency practices that already exist today.
2. I agree with the comment by Jofre on how anticipation of scrutiny would have implications for how scholars code their data. On balance, would such discipline not be a good thing?
3. Your recommendations is that journal requirements for transparency and data-sharing should remain relatively limited. A question on incentives: Will sound transparency practices develop without additional requirements from journals? I don’t have a full historical understanding, but it appears to me that journals - and their requirements for replication data, etc - have played a central role in promoting the practices of transparency that are taking hold in quantitative research. And they have played this role precisely because publication in journals is so closely linked to scholars’ incentives. In the absence of such incentives, the optimistic scenario is that individual scholars (such as those mentioned in your report) would take leadership on transparency, that others follow, and that it eventually will become increasingly suspect not to share your material publicly. This may happen and perhaps it is already underway. But given that researchers have remained so reluctant to share material in the past, how much trust can we place in this shift to transparency happening in the absence of clear incentives?
4. As I read the report, the main barrier for expansive journal requirements are the “intellectual property rights”. I fully sympathize with this point. But are there not solutions that would allow us to protect these “intellectual property rights” while simultaneously promoting the research discipline that stems from anticipation of future scrutiny? Are there practices that may allow wider sharing of material at the time of submission without suffering some of the costs listed under Q3? E.g., could one envision that a researcher, in submitting a manuscript to a journal, also submits a wide range of materials, but that these remain “under lock” for a certain period of time (e.g., 2-3 years). This would allow protect the “intellectual property rights” but simultaneously discipline the researcher to attain to some of the values promoted by transparency (i.e., making interpretations that are well-founded in data).
Please excuse me for being late to the party, I have just learned that it is going on… Your work on the report is very much appreciated! It clearly pinpoints the major issues I face when conducting but also when teaching (human coded) content analysis. In fact, I would also subscribe to most of your explicit suggestions for best practice.
Given that this document could stimulate a valuable debate, however, I also want to share some more critical but hopefully constructive reading impressions.
1) Benefits and costs of transparency
In the introductory parts (mainly Q2 and Q3, pp. 5-6) you draw a picture where the major benefits of transparency accrue mainly to the scholarly community only while the individual researcher primarily bears costs only. Beyond this dichotomy, I would suggest to also stress individual benefits of transparency more strongly here.
A) It is briefly mentioned further down in the text, but one major individual benefit of research transparency that could be stressed in Q2 is enhanced scholarly recognition (probably even beyond the particular sub-discipline). Publishing data does increase citation counts and many content analysis projects will start with establishing references points regarding the coding of relevant concepts which will ideally be reflected in corresponding references as well. I guess the issue here is making sure that credit is payed where credit is due. Two ideas on this:
i. Use repositories that establish a durable citation (e.g. via DOIs or hashes in the Harvard Dataverse) and make using this citation an obligation in the discipline.
ii. Push for a revival of “research notes” in major journals which allow publications that ‘only’ present datasets
B) Likewise I feel a bit uneasy with presenting the time and resources it takes to prepare and maintain documentation as an individual cost only. I would rather argue that good project organization is positively related to the quality of the resulting research. In this view, knowing that a data collection can be screened by somebody else actually helps researchers to organize well right from the start. Documenting individual decisions early is not only a burden, but actually helpful. In that sense, like Joffre, I am not so sure that the incentives you describe in Q2(5) are really that ‘perverse’ after all.
2) “Intellectual property rights” and quarantine periods
I do see the argument that data collectors should be able to reap the rewards of their efforts. On the one hand, this also pushes me to support research notes and citable repositories mentioned above. On the other hand, though, the three year period appears somewhat arbitrary. From a replicability stand point, the first substantial research publication would be the ‘natural’ time point for data publication. The risk that other researchers then come up with exactly the same ideas on data usage as the researcher having collected the data does not seem to be particularly high from my perspective.
3) Copyright issues
From my experience, this is one elephant in the room. Through the big data hype many providers of qualitative data (e.g. newspaper publishing houses) have learned about the treasures they control and try to monetarize them by raising access and usage barriers. Given cross-national variation in copyright law and fair-use rules I often face high legal uncertainty in publishing source materials. Typical questions that arise: Which jurisdiction applies (provider or user)? Which parts of the material can possibly published (e.g. newspaper articles vs lower-level coding units such as sentences or paragraphs)? These are clearly political tasks to tackle here, but if the debate should result in common scholarly transparency standards, the unequal distribution of such challenges has to be acknowledged as well.
3) Minor comments
I am probably missing something here, but I do not understand the distinction between coding decisions (8) and code books (9). Shouldn’t the latter contain everything that you need to understand the former? In discussions with Alex from the Authority Project I see that there might be choices that cannot be readily generalized. But if this exercise is about standard setting, then a clear line between such choices and the codebook has to be drawn imho.
One for the wish list: If possible in the light of copyright issues, I would like to strongly promote the storage of qualitative material and codes in common data frames or hermeneutic units. This is a necessary condition for possibly training machine learning algorithms on the often tremendous practical and conceptual efforts of human coding projects. This is not about replacing such efforts, but rather about amplifying the work of the original data collectors to other domains (which, in turn, should be measurable in increased scholarly recognition).
Thanks again for your work on this and for pushing the debate forward!