Big Data Analytics: POTUS Report

The aims of big data analytics are for data scientist to fuse data from various data sources, various data types, and in huge amounts so that the data scientist could find relationships, identify patterns, and find anomalies.  Big data analytics can help provide either a descriptive, prescriptive, or predictive result to a specific research question.  Big data analytics isn’t perfect, and sometimes the results are not significant, and we must realize that correlation is not causation.  Regardless, there are a ton of benefits from big data analytics, and this is a field where policy has yet to catch up to the field to protect the nation from potential downsides while still promoting and maximizing benefits.

Policies for maximizing benefits while minimizing risk in public and private sector

In the private sector, companies can create detailed personal profiles will enable personalized services from a company to a consumer.  Interpreting personal profile data would allow a company to retain and command more of the market share, but it can also leave room for discrimination in pricing, services quality/type, and opportunities through “filter bubbles” (Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  Policy recommendation should help to encourage de-identifying personally identifiable information to a point that it would not lead to re-identification of the data. Current policies for the private sector for promoting privacy are (Podesta, et al., 2014):

  • Fair Credit Reporting Act, helps to promote fairness and privacy of credit and insurance information
  • Health insurance Portability and Accountably Act enables people to understand and control how personal health data is used
  • Gramm-Leach-Bliley Act, helps consumers of financial services have privacy
  • Children’s Online Privacy Protection Act minimizes the collection/use of children data under the age of 13
  • Consumer Privacy bill of rights is a privacy blueprint that aids in allowing people to understand what their personal data is being collected and used for that are consistent with their expectation.

In the public sector, we run into issues, when the government has collected information about their citizens for one purpose, to eventually, use that same citizen data for a different purpose (Podesta, et al., 2014).  This has the potential of the government to exert power eventually over certain types of citizens and tamper civil rights progress in the future.  Current policies in the public sector are (Podesta, et al., 2014):

  • The Affordable Care Act allows for building a better health care system from a “fee-for-service” program to a “fee-for-better-outcomes.” This has allowed for the use of big data analytics to promote preventative care rather than emergency care while reducing the use of that data to eliminate health care coverage for “pre-existing health conditions.”
  • The Family Education Rights and Privacy Act, the Protection of Pupil Rights Amendment and the Children’s Online Privacy Act help seal children educational records to prevent misuse of that data.

Identifying opportunities for big data in the economy, health, education, safety, energy-efficiency

In the economy, the use of the internet of things to equip parts of product with sensors to help monitor and transmit live, thousands of data points for sending alerts.  These alerts can tell us when maintenance is needed, for which part and where it is located, making the entire process save time and improving overall safety(Podesta, et al., 2014).

In medicine, the use of predictive analytics could be used to identify instances of insurance fraud, waste, and abuse, in real time saving more than $115M per year (Podesta, et al., 2014).  Another instance of using big data is for studying neonatal intensive care, to help use current data to create prescriptive results to determine which newborns are likely to come into contact with which infection and what would that outcome be (Podesta, et al., 2014).  Monitoring newborn’s heart rate and temperature along with other health indicators can alert doctors of an onset of an infection, to prevent it from getting out of hand. Huge amounts of genetic data sets are helping locate genetic variant to certain types of genetic diseases that were once hidden in our genetic code (Podesta, et al., 2014).

With regards to national safety and foreign interests, data scientist and data visualizers have been using data gathered by the military, to help commanders solve real operational challenges in the battlefield (Podesta, et al., 2014).  Using big data analytics on satellite data, surveillance data, and traffic flow data through roads, are making it easier to detect, obtain, and properly dispose of improvised explosive devices (IEDs).  The Department of Homeland Security is aiming to use big data analytics to identify threats as they enter the country and people of higher than the normal probability to conduct acts of violence within the country (Podesta, et al., 2014). Another safety-related used of big data analytics is the identification of human trafficking networks through analyzing the “deep web” (Podesta, et al., 2014).

Finally for energy-efficiency, understanding weather patterns and climate change, can help us understand our contribution to climate change based on our use of energy and natural resources. Analyzing traffic data, we can help improve energy efficiency and public safety in our current lighting infrastructure by dimming lights at appropriate times (Podesta, et al., 2014).  Energy efficiencies can be maximized within companies using big data analytics to control their direct, and indirect energy uses (through maximizing supply chains and monitoring equipment).  Another way we are moving to a more energy efficient future is when the government is partnering with the electric utility companies to provide businesses and families access to their personal energy usage in an easy to digest manner to allow people and companies make changes in their current consumption levels (Podesta, et al., 2014).

Protecting your own privacy outside of policy recommendation

In this report it is suggested that we can control our own privacy through using the browse in private function in most current internet browsers, this would help prevent the collection of personal data (Podesta, et al., 2014). But, this private browsing varies from internet browser to internet browser.  For important information like being denied employment, credit or insurance, consumers should be empowered to know why they were denied and should ask for that information (Podesta, et al., 2014).  Find out the reason why can allow people to address those issues in order to persevere in the future.  We can encrypt our communications as well, in order to protect our privacy, with the highest bit protection available.  We need to educate ourselves on how we should protect our personal data, digital literacy, and know how big data could be used and abused (Podesta, et al., 2014).  While we wait for currently policies to catch up with the time, we actually have more power on our own data and privacy than we know.

 

Reference:

Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J. & Zients,  J. (2014). Big Data: Seizing Opportunities, Preserving Values.  Executive Office of the President. Retrieved from https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

Big Data Analytics: Crime Fighting

Case Study: Miami-Dade Police Department: New patterns offer breakthroughs for cold cases. 

Introduction:

Tourism is key to South Florida, bringing in $20B per year in a county of 2.5M people.  Robbery and the rise of other street crimes can hurt tourism and a 1/3 of the state’s sale tax revenue.  Thus, Lt. Arnold Palmer from the Robbery Investigation Police Department of Miami-Dade County teamed up with IT Services Bureau staff and IBM specialist to develop Blue PALMS (Predictive Analytics Lead Modeling Software), to help fight crime and protect the citizens and tourist to Miami-Dade County. When testing the tool it has achieved a 73% success rate when tested on 40 solved cases. The tool was developed because most crimes are usually committed by the same people who committed previous crimes.

 Key Problems:

  1. Cold cases needed to be solved and finally closed. Besides relying on old methods (mostly people skills and evidence gathering), patterns still could be missed, by even the most experienced officers.
  2. Other crimes like, robbery happen in predictable patterns (times of the day and location), which is explicit knowledge amongst the force. So, a tool shouldn’t tell them the location and the time of the next crime; the police need to know who did it, so a narrowed down list of who did it would help.
  3. The more experienced police officers are retiring, and their experience and knowledge leave with them. Thus, the tool that is developed must allow junior officers to ask the same questions of it and get the same answers as they would from asking those same questions to experienced officers.  Fortunately, the opportunity here is that newer officers come in with an embracing technology whenever they can, whereas veteran officers tread lightly when it comes to embracing technology.

Key Components to Success:

It comes to buy-in. Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

 Similar organizations could benefit:

Other policing counties in Florida, who have similar data collection measures as Miami-Dade County Police Departments would be a quick win (a short-term plan) for tool adoption.  Eventually, other police departments in Florida and other states can start adopting the tool, after more successes have been defined and shared by fellow police officers.  Police officers have a brotherhood mentality and as acceptance of this tool grows. Eventually it will reach critical mass and adoption of the tool will come much more quickly than it does today.  Other places similar to police departments that could benefit from this tool is firefighters, other emergency responders, FBI, and CIA.

June 2020 Editorial Piece:

Please note, that the accuracy of this crime-fighting model is based on the data coming in. Currently, the data that is being fed into these systems are biased towards people of color and the Black community, even though crime rates are not dependent on race (Alexander, 2010; Kendi, 2019; Oluo, 2018). If the system that generated the input data is biased towards people of color and Black people, when used by machine learning, it will create a biased predictive model. Alexander (2010) and Kendi (2019) stated that historically some police departments tend to prioritize and surveillance communities of color more than white communities. Thus, officers would accidentally find more crime in communities of color than white communities (confirmation bias), which can then feed an unconscious bias in the police force about these communities (halo and horns effect). Another, point mentioned in both Kendi (2019) and Alexander (2010), is we may have laws in the books but they are not applied equally among all races, some laws and sentencing guidelines are harsher on people of color and the Black community. Therefore, we must rethink how we are using these types of tools and what data is being fed into the system, before using them as a black-box predictive system. Finally, I want to address the comment mentioned above “The tool was developed because most crimes are usually committed by the same people who committed previous crimes.” This issue speaks more about mass incarceration, private prisons, and school to prison pipeline issues (Alexander, 2010). Addressing these issues should be a priority, to not create racist algorithms, along with allowing returning citizens to have access to opportunities and fully restored citizen rights so that “crime” can be reduced. However, these issues alone are out of the scope of this blog post.

 Resources:

Big Data Analytics: Open-Sourced Tools

Here are three open source text mining software tools for analyzing unstructured big data:

  1. Carrot2
  2. Weka
  3. Apache OpenNLP.

One of the great things about these three software tools is that they are free.  Thus, there is no cost per each software solution.

 Carrot2

A Java based code, which also has a native integration with PHP, and C#/.NET API (Gonzalez-Aguilar & Ramirez Posada, 2012).  Carrot2 can organize a collection of documents into categories based on themes in a visual manner; it can also be used as a web clustering engine. Carpineto, Osinski, Romano, and Weiss (2009) stated that web clustering search engines like Carrot2 help you with fast subtopic retrievals, (i.e. searching for tiger, you can get tiger woods, tigers, Bengals, Bengals football team, etc.), Topic exploration (through a cluster hierarchy), and alleviation information overlook (does more than the first page of results search). The algorithms it uses for categorization is Lingo (Lingo3G), K-mean, and STC, which can support multiple language clustering, synonyms, etc. (Carrot, n.d.).  This software can be used online instead of regular search engines as well (Gonzalez-Aguilar & Ramirez Posada, 2012).  Gonzalez-Aguilar and Ramirez Posada (2012) explain that the interface has three phases for processing information: entry, filtration, and exit.  It represents the cluster data in three visual formats: Heatmap, Network, and pie chart.

The disadvantage of this tool is that it only does clustering analysis, but its advantage is that it can be applied to a search engine to facilitate faster and more accurate searches through its subtopic analysis.  If you would like to use Carrot2 as a search engine, go to http://search.carrot2.org/stable/search and try it out.

Weka

It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (Weka, n.d). Weka can be applied to big data (Weka, n.d.) and SQL Databases (Patel & Donga, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if you can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage.

 Apache OpenNLP

A Java code conventional machine learning toolkit, with tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution (OpenNLP, n.d.) OpenNLP works well with the NetBeans and Eclipse IDE, which helps in the development process.  This tool has dependencies on Maven, UIMA Annotators, and SNAPSHOT.

The advantage of OpenNLP is that specification of rules, constraints, and lexicons don’t need to be entered in manually. Thus, it is a machine learning method which aims to maximize entropy (Buyko, Wermter, Poprat, & Hahn, 2006).  Maximizing entropy allows for collect facts consistently and uniformly.  When the sentence splitter, tokenization, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution was tested on two medical corpora, accuracy was up in the high 90%s (Buyko et al., 2006).

This software has high accuracy as its advantage, but also produces quite a bit of false negatives which is its disadvantage.   In the sentence splitter function, it picked up literature citations, and in tokenization, it took specialized characters “-” and “/” (Buyko et al., 2006).

 References:

  • Buyko, E., Wermter, J., Poprat, M., & Hahn, U. (2006). Automatically adapting an NLP core engine to the biology domain. In Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data M ining in Association with ISMB (pp. 65-68).
  • Carpineto, C., Osinski, S., Romano, G., and Weiss, D. 2009. A survey of web clustering engines. ACM Comput. ´ Surv. 41, 3, Article 17 (July 2009), 38 pages. DOI = 10.1145/1541880.1541884 http://doi.acm.org/10.1145/1541880.1541884
  • Carrot (n.d.) Open source framework for building search clustering engines. Retrieved from http://project.carrot2.org/index.html
  • Gonzalez-Aguilar, A. AND Ramirez-Posada, M. (2012): Carrot2: Búsqueda y visualización de la información (in Spanish). El Profesional de la Informacion. Retrieved from http://project.carrot2.org/publications/gonzales-ramirez-2012.pdf
  • openNLP (n.d.) The Apache Software Foundation: OpenNLP. Retrieved from https://opennlp.apache.org/
  • Weka (n.d.) Weka 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).

Big Data Analytics: R

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources:

Big Data Analytics: Installing R

I didn’t have any problems with the installation thanks to a video produced by Dr. Webb (2014).  It is a bigger package than what I thought it would be, so it can take a few minutes to download, depending on your download speed and internet connection. Thus,

(1)    For proper installation of R, you need to have administrative access on your computer.

(2)    Watch this video, to get a step-by-step instructions and an online tutorial to installing R and its graphical Integrated Development Environment (IDE).

  1. Note: The application for R 32x and 64x can be found at http://cran.r-project.org/
  2. Note: The Rstudio free “Desktop” graphical IDE can be found at http://www.rstudio.com/

(3)    Once installed Use the manual for this application at this site: http://cran.r-project.org/doc/manuals/R-intro.html

Once, I installed the software and the graphical IDE, I continued to follow along with the video to use the prepopulated Cars data under the “datasets” Packages, and I got the same result as shown in the video.  I also would like to note that Dr. Webb (2014) also had checked the Packages: “datasets,” “graphics,” “grDevices,” “methods,” and “stats” in the video, which can be hard to see depending on your video streaming resolution.

Resources:

Webb, J. (2014). Installing and Using the “R” Programming Language and RStudio. Retrieved from https://www.youtube.com/watch?v=77PgrZSHvws&feature=youtu.be

Big Data Analytics: Hadoop®

Hadoop® Distributed File System (HFDS):

HFDS big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs).

Example 1:

An example of HFDS stored data, is to think of a deck of cards, which each card holds information about what it is, value, color, symbol, etc.  HFDS can divide the data into blocks by A, 2, 3 … J, Q, & K, thus each block will hold about four card data each.  Thus, there are 13 distinct data blocks, which have been parsed by their value and placed on 13 different servers.  Let’s also assume I need higher than average availability, so rather than two copies, I need four copies of the J, Q, & K values, and 2 for A, 1, 2 … 10.  This is possible.  Each of the copies could be clustered in similar servers, or each can have one server on its own.  This type of redundancies in my data within HFDS has the benefit of higher availability of my data.  Thus, when I need to analyze my data on my deck of cards, I can say, the important values J, Q, & K have a higher chance of being available than my perceived lower value cards A, 2 … 10.

MapReduce:

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013 & Sathupadi, 2010). Reducers can work on different keys.  Huge amounts of data are entered into MapReduce, then the Mapper maps the data, then the data is shuffled and sorted before it is reduced.  Once the data is reduced, we get the output that we sought.

IBM’s (n.d.) MapReduce functions using the HFDS will run its procedures on the server in which the data is stored (also known as data locality).  Keeping in mind that HFDS has at least two backup copies, if one server goes down, which can happen, it can continue doing the tasks on the same data on a different server that is working.  This backup system for disaster recovery allows for high data availability.

Example 2:

Adjusted from Sathupadi (2010), is to look at how MapReduce can calculate the sum of all of Harvard Law Students and Medical Students current outstanding school loans per degree type.  Thus, the final output from our example would be Juris Doctorate (JD) Students Current Outstanding School Loan Amount and Latin Legum Magister (LLM) Students Current Outstanding School Loan Amount, and Doctor of Medicine (MD) School Loan Amount and Doctor of Osteopathic Medicine (DO) School Loan Amount.

If I ran this in Hadoop, a single copy of the data can be stored in 50 servers, and thus 50 nodes could be used to process this transaction request in parallel, speeding up the time it would take significantly but not by 50 fold.  The reason as to why not 50 fold is because it takes the time to reduce from mapping and nodes need to talk to each other, which slows down the speed of transaction.  So, running on X amount parallel never really is like saying we are X times faster, in reality, we are X-e times faster (where e is the transaction cost).

The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, Doctorate of Philosophy Students, Master Degree Students, etc.  Only JD, LLM, MD, and DO Students will get one key each assigned to them, keys that are similar to all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.  If data is duplicated at least twice on different servers, if a server were to go down, the MapReduce function will move on to a copy of that data in which can still be mapped and reduced.

 Resources:

 

Big Data Analytics: Cloud Computing

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.

Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers with respect to what is managed by the cloud provider.  For IaaS the company manages the applications, data, runtime, and middleware, whereas the provider administers the O/S, virtualization, servers, storage, and networking.  For PaaS the company manages the applications, and data, whereas the vendor, administers the runtime, middleware, O/S, virtualization, servers, storage, and networking.  Finally SaaS the provider manages it all: application, data, O/S, virtualization, servers, storage, and networking (Lau, 2011).  This differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.

Examples of IaaS are Amazon Web Services, Rack Space, and VMware vCloud.  Examples of PaaS are Google App Engine, Windows Azure Platform, and force.com. Examples of SaaS are Gmail, Office 365, and Google Docs (Lau, 2011).

There are benefits of cloud is this pay-as-you-go business model.  One, the company can pay for as much (SaaS) or as little (IaaS) of the service that they need and how much in space they require. Two, the company can go on an On-Demand model, which businesses can scale up and down as they need (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009).  For example, if a company would like a development environment for 3 weeks, they can build it up in the cloud for that time period and spend money for using the service for 3 weeks rather than buying a new set of infrastructure and setting up all the libraries.  This can help speed up the development speed in a ton of applications moving forward when you elect the cloud versus buying a new infrastructure.  These models are like renting a car.  Renting a car for what you need, but you are paying for what you use (Lau, 2011).

Replacing Conventional Data Center?

Infrastructure costs are really high.  For a company to be spending that much money on something that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Outsourcing, infrastructure is the first step of company’s movement into the cloud.  However, companies need to understand the different privacy flavors well, because if data is stored in a public cloud, it will be hard to destroy the hardware, because you will destroy not only your data, but other people’s and company’s data.  Private clouds are best for government agencies which may need or require physical destruction of the hardware.  Government agencies may even use hybrid structures, keeping private data in the private clouds and the public stuff in a public cloud.  Companies that contract with the government could migrate to hybrid clouds in the future, and businesses without contracts with the government could go onto a public cloud.  There may always be a need to store the data on a private server, like patents, of KFC’s 7 herbs and spices recipe, but for the majority of the data, personally the cloud may be a grand place to store and work off of.

Note: Companies that do venture into moving into a cloud platform and storing data, they should focus on migrating data and data dictionaries slowly and with uniformity.  Data variables should have the same naming convention, one definition, a list of who is responsible for the data, meta-data, etc.  This would be a great chance for companies, while in migration to a new infrastructure to clean up their data.

Resources:

 

Big Data Analytics: Advertising

Advertising went from focusing on sales to a consumer focus, to social media advertising, to now trying to establish a relationship with consumers.  In the late 1990s and early 2000s, third party cookies were used on consumers to help deliver information to the company and based on the priority level of those cookies banner ads will appear selling targeted products on other websites (sometimes unrelated to the current search).  Sometimes you don’t even have to click on the banner for the cookies to be stored (McNurlin, Sprague, & Bui, 2008).  McNurlin et al. (2008) then talk about how current consumer shopping data was collected by loyalty cards, through BlockBuster, Publix, Winn-Dixie, etc.

Before all of this in the 1980s-today, company credit cards like a SEARS Master Card could have captured all this data, even though they had a load of other data that was collected that may not have helped them with selling/advertising a particular product mix that they carry.  They would help influence the buyer with giving them store discounts if the card was used in their location to drive more consumption.  Then they could target ads/flyers/sales based on the data they have gathered through each swipe of the card.

Now, in today’s world we can see online profiling coming into existence.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Online profiling straddles the point of becoming useful, annoying, or “Big Brother is watching” (Pophal, 2014).  Profiling began as third party cookies and have evolved with the times to include 40 different variables that could be sent off from your mobile device when the consumer uses it while they shop (Pophal, 2014).  This online profiling now allows for marketers to send personalized and “perfect” advertisements to the consumer, instantly.  However, as society switches from device to device, marketers must find the best way to continue the consumer’s buying experience without becoming too annoying, which can turn the consumer away from using the app and even buying the product (Pophal, 2014).  The best way to describe this is through this quote by a modern marketer in Phophal (2014): “So if I’m in L.A., and it’s a pretty warm day here-85 degrees-you shouldn’t be showing me an ad for hot coffee; you should be showing me a cool drink.” Marketers are now aiming to build a relationship with the consumers, by trying to provide perceived value to the customer, using these types of techniques.

Amazon tries a different approach, as items get attached to the shopping cart and before purchases, they use aggregate big data to find out what other items this consumer would purchase (Pophal, 2014) and say “Others who purchased X also bought Y, Z, and A.”  This quote, almost implies that these items are a set and will enhance your overall experience, buy some more.

Resources:

 

Big Data Analytics: Privacy & HIPAA

Since its inception 25 years ago, the human genome project has been sequenced many 3B base pair of the human genomes (Green, Watson, & Collins, 2015).  This project has given rise of a new program, the Ethical, Legal and Social Implication (ELSI) project.  ELSI got 5% of the National Institute of Health Budget, to study ethical implications of this data, opening up a new field of study (Green et al., 2015 & O’Driscoll, Daugelaite, & Sleator, 2013).  Data sharing must occur, to leverage the benefits of the genome projects and others like it.  Poldrak and Gorgolewski (2014) stated that the goals of sharing data help out with the advancement of the field in a few ways: maximizing the contribution of research subjects, enabling responses to new questions, enabling the generation of new questions, enhance research results reproducibility (especially when the data and software used are combined), test bed for new big data analysis methods, improving research practices (development of a standard of ethics), reducing the cost of doing the science (what is feasible for one scientist to do), and protecting valuable scientific resources (via indirectly creating a redundant backup for disaster recovery).  Allowing for data sharing of genomic data can present ethical challenges, yet allow for multiple countries and disciplines to come together and analyze data sets to come up with new insights (Green et al., 2015).

Richards and King (2014), state that concerning privacy, we must think of it regarding the flow of personal information.  Privacy cannot be thought of as a binary, as data is private and public, but within a spectrum.  Richards and Kings (2014) argue that the data as exchanged between two people has a certain level of expectation of privacy and that data can remain confidential, but there is never a case were data is in absolute private or public.  Not everyone in the world would know or care about every single data point, nor will any data point be kept permanently secret if it is uttered out loud from the source.  Thus, Richards and Kings (2014) stated that transparency can help prevent abuse of the data flow.  That is why McEwen, Boyer, and Sun (2013) discussed that there could exist options for open-consent (your data can be used for any other future research project), broad-consent (describe various ways the data could be used, but it is not universal), or an opt-out-consent (where participants can say what their data shouldn’t be used for).

Attempts are being made through the enactment of Genetic Information Nondiscrimination Act (GINA) to protect identifying data for fears that it can be used to discriminate against a person with a certain type of genomic indicator (McEwen et al., 2013).  Internal Review Boards and Common Rules, with the Office of Human Research Protection (OHRP), have guidance on information flow that is de-identified.  De-identified information can be shared and is valid under current Health Insurance Portability and Accounting Act of 1996 (HIPAA) rules (McEwen et al, 2013).  However, fear of loss of data flow control comes from increase advances in technological decryption and de-anonymisation techniques (O’Driscoll et al., 2013 and McEwen et al., 2013).

Data must be seen and recognized as a person’s identity, which can be defined as the “ability of individuals to define who they are” (Richards & Kings, 2014). Thus, the assertion made in O’Driscoll et al. (2013) about how the ability to protect medical data, with respects to bid data and changing concept, definitional and legal landscape of privacy is valid.  Thanks to HIPAA, cloud computing, is currently on a watch list. Cloud computing can provide a lot of opportunity for cost savings. However, Amazon cloud computing is not HIPAA compliant, hybrid clouds could become HIPAA, and commercial cloud options like GenomeQuest and DNANexus are HIPAA compliant (O’Driscoll et al., 2013).

However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?  How far back in years should that go through?

Other ethical issues to consider: When it comes to data sharing, how should the researchers who collected the data, but didn’t analyze it should be positively incentivized?  One way is to make them co-author of any publication revolving their data, but then that makes it incompatible with standards of authorships (Poldrack & Gorgolewski, 2013).

 

Resources:

  • Green, E. D., Watson, J. D., & Collins, F. S. (2015). Twenty-five years of big biology. Nature, 526.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.
  • Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: data sharing in neuroimaging. Nature Neuroscience, 17(11), 1510-1517
  • O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data,’ Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
  • Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest L. Rev., 49, 393.