Using R and Spark for health care

Use of R with regard to healthcare field case study by Pereira and Noronha (2016):

R and RStudio have been used to look at patient health and diseases records located in Electronic Medical Records (EMR) for fraud detection.  Anomaly detection revolves around using a mapping code that filters data based on geo-locations.  Secondly, a reducer code which aggregates the data based on extreme values of cost claims per disease along with calculating the difference.  Finally, a code that analyzed the data that meets a 60% cost fraud threshold. It was found that as the geo-location resolution increased, the anomalies detected increased.

R and RStudio have been able to use big data analytics to predict diabetes from the Health Information System (HIS) which houses patient information, based on symptoms. For predicting diabetes, the authors used a classification algorithm (decision tree) with a 70%-30% training-test dataset split, to eventually plot the false positive rate v. True positive rate.  This plot showed skill in predicting diabetes.

Use of Spark about the healthcare field case study by Pita et al. (2015):

Data quality in healthcare data is poor and in particular that from the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.


  • Pereira, J. P., & Noronha, V. (2016). Anomalies Detection and Disease Prediction in Healthcare Systems using Big Data Analytics. Retrieved from
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).