The manipulation of data in science: challenges of assessing results received through quantitative and qualitative methods

Geoffrey Brian West,
Theoretical physicist, former president and distinguished professor of the Santa Fe Institute.

“If I have seen further, it is by standing upon the shoulders of giants.”
Isaac Newton

Initially, there are many aspects, approaches, tools, and even hints for data analysis and representation that one would subsequently characterize as ‘manipulation of data’. In the given piece prepared based on online panel discussion of the conference “Challenges of Source Evaluation in Science and Correlated Areas,” the main focus would be on main reasons and prevailing circumstances that lead to data manipulation. From Professor Geoffrey West’s perspective, there are two main parts in this: access to data and credibility of that data.

For many years I have conducted my research in high energy physics, having due access and thus taking data from companies that could be described metaphorically as ‘huge scientific accelerators’, for instance, CERN in Switzerland (Geneva). As a matter of fact, in this case, a researcher faces certain barriers since one deals with specific experimental data. Typically, in high-energy physics, one does not have the realistic possibility of re-running the experiment, the privilege one possesses in some other traditional scientific disciplines. I believe that it is a crucial aspect to be able to confirm theories and predictions, hypotheses, and analysis results for other things for scientific progress. Nevertheless, there blossoms an enormous trust dimension that whatever this group of thousand experimentalists performed together is correct data. Sometimes such experimental groups declare: “This is what we measured. This is the truth. The rest have to trust us”. And they have manipulated the data because all kinds of corrections made right in the research process; not infrequently, plenty of slightest manipulations are taken to fit ‘the result’ into a ‘common form’ that can be used by other researchers. However, one has to trust even that sort of data. Naturally, the credence level, for sure, depends on the reputation and buildup of scientific profile over many years. Now, this is one pole of the problem.

There is another extreme pole, as well. I have collaborated with companies and social organizations and faced a problem of a very different nature: the proprietorship of data. That means either the data may exist, but one cannot get access to it, or one has to pay huge amounts of money to get the data required. Again, that data has been ‘marinated’ to a certain extent; overall, it is not the pure data one seeks, but the ‘manipulated’ one, even if one pays that money—for instance, the data coming from the tax returns of companies. Overall, the tendency is as follows: great confidence in data is frequently not verified, which is an enormous problem.

However, even more striking difficulty roots in the fact that if one requires to comprehend an organization as a system or ‘living organism’, he intends to understand what ‘goes on’ inside that organization. Then one more problem occurs, which could be characterized by the question ‘What is the internal data specific company or a group of companies?’ Generally, one never gets access to that ‘internal data’ box. Some companies send a researcher the documents that are analogous to organizational charts, which are idealized versions of what the company is. This does not reflect reality; it does not indicate the communication system, interchangeability systems, and so forth. Overall, this is a problematic field since data has been manipulated in some form or another. I have not come across any verified methodology or set of data assessment tools to get around these extremities since organizations (like the ones mentioned above) are under no obligation to supply scientists or others (even politicians) interested in the raw data. Consequently, this is a significant issue, particularly in the social sciences. In the physical sciences, I would note that it is less of a problem; that becomes an issue in some of the biological, medical, and pharmaceutical sciences because of the inevitable role of money and payoff and who got there first. Associated with that is federal agencies’ attempts across the globe when they support research to insist that researchers make their data available; there comes no transparency at all. However, many researchers ignore this fact. That is a truly unsolved problem since scientists and researchers are being pushed to apply the data they supply that is sometimes not credible.

Overall, these are the most crucial aspects and data manipulation problems to be solved. I believe there is a solution to these problems, and it can be achieved mainly through the cooperative efforts of members of the scientific and academic communities. Moreover, I am concerned that the same aspects and analogous problems stay behind the Wikipedia data accuracy and credibility problem (the same question concerns the other online digital encyclopedias). The environment dictates its terms and makes changes; these days, people operate primarily through the web. We all have time constraints, as well as access constraints. I am concerned about the long-term access to both — the raw data that a researcher or a company or politician could have access; the group of methods and approaches set helps one verify the data and ensure its credibility.

I am equally concerned that some data pieces have been traditionally manipulated. That is, manipulated not negatively but positively presented data to the scientific audience, that any researcher can use effectively. Nevertheless, it also has to be concluded, and scholars are frequently getting, receiving and even producing data that has been ‘modified’ for whoever deals with the research results afterward. That is done both in terms of their research, their proprietorship, and their claims to originality. Finally, I suggest that if these problems of manipulating data and sources of scientific information are not resolved now, they will only aggravate the overall situation of science in the future.

Papers