Ethical issues involving human subjects

In Creswell (2013), it is stated that ethical issues can occur at all phases of the study (prior to the study, in the beginning, during data collection, analysis, and reporting).  Since we deal with data from people about people, we as researchers need to protect our participants and promote the integrity of research by guarding against misconduct and improperly reflecting the data.  Because we deal with people, it is our obligation to assure that interviewees do not get harmed as a result of our research (Rubin, 2012). The following anticipated risks are from Crewell (2013) and Rubin (2012):

  • Prior to conducting the study
    • We must seek an Institutional Review Board (IRB) approval before we conduct a study.
    • I must gain local permission from the agency, organization, corporation for which the study will take place and from the participants to conduct this study.
  • Beginning the study
    • We will not pressure participants to sign consent forms. To make sure that you have high participation rates, you need to make sure that the purpose of this study is compelling enough that the participants will see that it would be a value-added experience to them as well as to the field of study that they don’t want to say no.
      • We should also conduct an informal needs assessment to ensure that the participant’s needs are addressed in the study, to ensure a high participation rate.
      • But, we will tell the participants that they have the right not to sign the consent form.
    • Collecting data
      • Respecting the site and keep disruption to a minimum, especially if I am conducting observations. The goal of the observation in this study is not to be an active participant, but taking field notes of key interactions that occur while the participants are doing what they need to do.
      • Make sure that all the participants in the study receive the same treatment to avoid data quality issues while collecting it.
      • We should be respectful and straightforward to the participants.
      • Discuss the purpose of this study and how the data will be used with the participants is key to establishing trust and this would allow them to start thinking about the topic of the study. This can be accomplished by sending them an email prior to the interview as to the purpose of the study and the time we are requesting of them.
      • As we are asking our interviewing questions, we should avoid leading questions. That is why questions may be asked in a particular order.  In some cases, questions can build on one another.
      • We should avoid sharing personal impressions. Given that we know what the final questions in the interview are, as we should ask them questions while not giving any indication of what we are looking for so that they don’t end up contaminating our data.
      • Avoid disclosing sensitive or proprietary information.
    • Analyzing data
      • Avoid only disclosing one set of results, thus we must report on multiple perspectives and report contrary findings.
      • Keeping the privacy of the participants, assuring that the names have been removed from the results as well as any other identifying indicators.
      • Honor promises, if I offer to the participant a chance to read and correct their interviews, I should do so as soon as possible after the interview.
    • Reporting, sharing and storing data
      • Avoid situations where there is a temptation to falsify evidence, data, findings or conclusions. This can be accomplished through using unbiased language appropriate for audiences.
      • Avoid disclosing harmful information of the specialist.
      • Be able to have data in a shareable format, however with keeping the privacy of the specialist as the main priority, while keeping the raw data and other materials for 5 years in a secure location. Part of this data should consist of the complete proof of compliance, IRB, lack of conflict of interest, for if and when that is requested.

References:

Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at http://mahout.apache.org (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.

Resources: