Big Data Analytics: POTUS Report

This has become a data-centric society, relying on real-time data and technology (i.e., cell phone, shopping online, social networking) more than ever. Although there are many advantages associated with the use of this data, there are concerns that the collection of massive amounts of data can lead to an invasion of privacy. In January, 2014, President Obama asked his staff to take the next 90 days to prepare a report for him on how big data is affecting people’s privacy. This post revolves around this report.

The aims of big data analytics are for data scientist to fuse data from various data sources, various data types, and in huge amounts so that the data scientist could find relationships, identify patterns, and find anomalies.  Big data analytics can help provide either a descriptive, prescriptive, or predictive result to a specific research question.  Big data analytics isn’t perfect, and sometimes the results are not significant, and we must realize that correlation is not causation.  Regardless, there are a ton of benefits from big data analytics, and this is a field where policy has yet to catch up to the field to protect the nation from potential downsides while still promoting and maximizing benefits.

Policies for maximizing benefits while minimizing risk in public and private sector

In the private sector, companies can create detailed personal profiles will enable personalized services from a company to a consumer.  Interpreting personal profile data would allow a company to retain and command more of the market share, but it can also leave room for discrimination in pricing, services quality/type, and opportunities through “filter bubbles” (Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  Policy recommendation should help to encourage de-identifying personally identifiable information to a point that it would not lead to re-identification of the data. Current policies for the private sector for promoting privacy are (Podesta, et al., 2014):

  • Fair Credit Reporting Act, helps to promote fairness and privacy of credit and insurance information
  • Health insurance Portability and Accountably Act enables people to understand and control how personal health data is used
  • Gramm-Leach-Bliley Act, helps consumers of financial services have privacy
  • Children’s Online Privacy Protection Act minimizes the collection/use of children data under the age of 13
  • Consumer Privacy bill of rights is a privacy blueprint that aids in allowing people to understand what their personal data is being collected and used for that are consistent with their expectation.

In the public sector, we run into issues, when the government has collected information about their citizens for one purpose, to eventually, use that same citizen data for a different purpose (Podesta, et al., 2014).  This has the potential of the government to exert power eventually over certain types of citizens and tamper civil rights progress in the future.  Current policies in the public sector are (Podesta, et al., 2014):

  • The Affordable Care Act allows for building a better health care system from a “fee-for-service” program to a “fee-for-better-outcomes.” This has allowed for the use of big data analytics to promote preventative care rather than emergency care while reducing the use of that data to eliminate health care coverage for “pre-existing health conditions.”
  • The Family Education Rights and Privacy Act, the Protection of Pupil Rights Amendment and the Children’s Online Privacy Act help seal children educational records to prevent misuse of that data.

Identifying opportunities for big data in the economy, health, education, safety, energy-efficiency

In the economy, the use of the internet of things to equip parts of product with sensors to help monitor and transmit live, thousands of data points for sending alerts.  These alerts can tell us when maintenance is needed, for which part and where it is located, making the entire process save time and improving overall safety(Podesta, et al., 2014).

In medicine, the use of predictive analytics could be used to identify instances of insurance fraud, waste, and abuse, in real time saving more than $115M per year (Podesta, et al., 2014).  Another instance of using big data is for studying neonatal intensive care, to help use current data to create prescriptive results to determine which newborns are likely to come into contact with which infection and what would that outcome be (Podesta, et al., 2014).  Monitoring newborn’s heart rate and temperature along with other health indicators can alert doctors of an onset of an infection, to prevent it from getting out of hand. Huge amounts of genetic data sets are helping locate genetic variant to certain types of genetic diseases that were once hidden in our genetic code (Podesta, et al., 2014).

With regards to national safety and foreign interests, data scientist and data visualizers have been using data gathered by the military, to help commanders solve real operational challenges in the battlefield (Podesta, et al., 2014).  Using big data analytics on satellite data, surveillance data, and traffic flow data through roads, are making it easier to detect, obtain, and properly dispose of improvised explosive devices (IEDs).  The Department of Homeland Security is aiming to use big data analytics to identify threats as they enter the country and people of higher than the normal probability to conduct acts of violence within the country (Podesta, et al., 2014). Another safety-related used of big data analytics is the identification of human trafficking networks through analyzing the “deep web” (Podesta, et al., 2014).

Finally for energy-efficiency, understanding weather patterns and climate change, can help us understand our contribution to climate change based on our use of energy and natural resources. Analyzing traffic data, we can help improve energy efficiency and public safety in our current lighting infrastructure by dimming lights at appropriate times (Podesta, et al., 2014).  Energy efficiencies can be maximized within companies using big data analytics to control their direct, and indirect energy uses (through maximizing supply chains and monitoring equipment).  Another way we are moving to a more energy efficient future is when the government is partnering with the electric utility companies to provide businesses and families access to their personal energy usage in an easy to digest manner to allow people and companies make changes in their current consumption levels (Podesta, et al., 2014).

Protecting your own privacy outside of policy recommendation

In this report it is suggested that we can control our own privacy through using the browse in private function in most current internet browsers, this would help prevent the collection of personal data (Podesta, et al., 2014). But, this private browsing varies from internet browser to internet browser.  For important information like being denied employment, credit or insurance, consumers should be empowered to know why they were denied and should ask for that information (Podesta, et al., 2014).  Find out the reason why can allow people to address those issues in order to persevere in the future.  We can encrypt our communications as well, in order to protect our privacy, with the highest bit protection available.  We need to educate ourselves on how we should protect our personal data, digital literacy, and know how big data could be used and abused (Podesta, et al., 2014).  While we wait for currently policies to catch up with the time, we actually have more power on our own data and privacy than we know.

 

Reference:

Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J. & Zients,  J. (2014). Big Data: Seizing Opportunities, Preserving Values.  Executive Office of the President. Retrieved from https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

Big Data Analytics: Crime Fighting

Big data analytics can have a profound effect on the success of a business. Several case studies regarding this success can be found. This post will introduce and examine the content of the case study, explain what were the key problems that needed to be resolved, and identify key components that lead to the case’s success.

Case Study: Miami-Dade Police Department: New patterns offer breakthroughs for cold cases. 

Introduction:

Tourism is key to South Florida, bringing in $20B per year in a county of 2.5M people.  Robbery and the rise of other street crimes can hurt tourism and a 1/3 of the state’s sale tax revenue.  Thus, Lt. Arnold Palmer from the Robbery Investigation Police Department of Miami-Dade County teamed up with IT Services Bureau staff and IBM specialist to develop Blue PALMS (Predictive Analytics Lead Modeling Software), to help fight crime and protect the citizens and tourist to Miami-Dade County. When testing the tool it has achieved a 73% success rate when tested on 40 solved cases. The tool was developed because most crimes are usually committed by the same people who committed previous crimes.

 Key Problems:

  1. Cold cases needed to be solved and finally closed. Besides relying on old methods (mostly people skills and evidence gathering), patterns still could be missed, by even the most experienced officers.
  2. Other crimes like, robbery happen in predictable patterns (times of the day and location), which is explicit knowledge amongst the force. So, a tool shouldn’t tell them the location and the time of the next crime; the police need to know who did it, so a narrowed down list of who did it would help.
  3. The more experienced police officers are retiring, and their experience and knowledge leave with them. Thus, the tool that is developed must allow junior officers to ask the same questions of it and get the same answers as they would from asking those same questions to experienced officers.  Fortunately, the opportunity here is that newer officers come in with an embracing technology whenever they can, whereas veteran officers tread lightly when it comes to embracing technology.

Key Components to Success:

It comes to buy-in. Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

 Similar organizations could benefit:

Other policing counties in Florida, who have similar data collection measures as Miami-Dade County Police Departments would be a quick win (a short-term plan) for tool adoption.  Eventually, other police departments in Florida and other states can start adopting the tool, after more successes have been defined and shared by fellow police officers.  Police officers have a brotherhood mentality and as acceptance of this tool grows. Eventually it will reach critical mass and adoption of the tool will come much more quickly than it does today.  Other places similar to police departments that could benefit from this tool is firefighters, other emergency responders, FBI, and CIA.

 Resources:

Big Data Analytics: Open-Sourced Tools

It is critical that big data analysts are provided with software tools to effectively synthesize large data sets. Various software tools for analyzing big data have emerged in popularity over the past few years. This post will compare and contrast at least 3 software tools that I found most effective in analyzing big data. This discussion includes the advantages and disadvantages each application.

Here are three open source text mining software tools for analyzing unstructured big data:

  1. Carrot2
  2. Weka
  3. Apache OpenNLP.

One of the great things about these three software tools is that they are free.  Thus, there is no cost per each software solution.

 Carrot2

A Java based code, which also has a native integration with PHP, and C#/.NET API (Gonzalez-Aguilar & Ramirez Posada, 2012).  Carrot2 can organize a collection of documents into categories based on themes in a visual manner; it can also be used as a web clustering engine. Carpineto, Osinski, Romano, and Weiss (2009) stated that web clustering search engines like Carrot2 help you with fast subtopic retrievals, (i.e. searching for tiger, you can get tiger woods, tigers, Bengals, Bengals football team, etc.), Topic exploration (through a cluster hierarchy), and alleviation information overlook (does more than the first page of results search). The algorithms it uses for categorization is Lingo (Lingo3G), K-mean, and STC, which can support multiple language clustering, synonyms, etc. (Carrot, n.d.).  This software can be used online instead of regular search engines as well (Gonzalez-Aguilar & Ramirez Posada, 2012).  Gonzalez-Aguilar and Ramirez Posada (2012) explain that the interface has three phases for processing information: entry, filtration, and exit.  It represents the cluster data in three visual formats: Heatmap, Network, and pie chart.

The disadvantage of this tool is that it only does clustering analysis, but its advantage is that it can be applied to a search engine to facilitate faster and more accurate searches through its subtopic analysis.  If you would like to use Carrot2 as a search engine, go to http://search.carrot2.org/stable/search and try it out.

Weka

It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (Weka, n.d). Weka can be applied to big data (Weka, n.d.) and SQL Databases (Patel & Donga, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if you can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage.

 Apache OpenNLP

A Java code conventional machine learning toolkit, with tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution (OpenNLP, n.d.) OpenNLP works well with the NetBeans and Eclipse IDE, which helps in the development process.  This tool has dependencies on Maven, UIMA Annotators, and SNAPSHOT.

The advantage of OpenNLP is that specification of rules, constraints, and lexicons don’t need to be entered in manually. Thus, it is a machine learning method which aims to maximize entropy (Buyko, Wermter, Poprat, & Hahn, 2006).  Maximizing entropy allows for collect facts consistently and uniformly.  When the sentence splitter, tokenization, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution was tested on two medical corpora, accuracy was up in the high 90%s (Buyko et al., 2006).

This software has high accuracy as its advantage, but also produces quite a bit of false negatives which is its disadvantage.   In the sentence splitter function, it picked up literature citations, and in tokenization, it took specialized characters “-” and “/” (Buyko et al., 2006).

 References:

  • Buyko, E., Wermter, J., Poprat, M., & Hahn, U. (2006). Automatically adapting an NLP core engine to the biology domain. In Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data M ining in Association with ISMB (pp. 65-68).
  • Carpineto, C., Osinski, S., Romano, G., and Weiss, D. 2009. A survey of web clustering engines. ACM Comput. ´ Surv. 41, 3, Article 17 (July 2009), 38 pages. DOI = 10.1145/1541880.1541884 http://doi.acm.org/10.1145/1541880.1541884
  • Carrot (n.d.) Open source framework for building search clustering engines. Retrieved from http://project.carrot2.org/index.html
  • Gonzalez-Aguilar, A. AND Ramirez-Posada, M. (2012): Carrot2: Búsqueda y visualización de la información (in Spanish). El Profesional de la Informacion. Retrieved from http://project.carrot2.org/publications/gonzales-ramirez-2012.pdf
  • openNLP (n.d.) The Apache Software Foundation: OpenNLP. Retrieved from https://opennlp.apache.org/
  • Weka (n.d.) Weka 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).

Big Data Analytics: R

R has proven to be the most effective software tool for analyzing big data. This post will be quick discuss of my evaluation of this tool and its relevance in the big data arena, with respects to text mining.

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources: