Big Data Analytics: Career Prospects

There is a wealth of employment opportunities for professionals with knowledge of big data analytics. This post explores what employment opportunities currently exist for people graduating with a concentration in data analytics.

Advertisements

Masters and Doctoral graduates have some advantages over Undergraduates, because they have done research or capstones involving big datasets, they can explain the motivations and reasoning behind the work (chapter 1 & 2 of the dissertation), they can learn and adapt quickly (chapter 3 reflects what you have learned and how you will apply it), and they can think critically about problems (chapter 4 & 5 of the dissertation).  Doctoral student, work on a problem for multiple months/years to see a solution (filling in a gap in the knowledge) that they couldn’t dream of seeing as incomplete (or unfillable).  But, to prepare best for a data science position or big data position, the doctoral shouldn’t be purely theoretical, and should contain an analysis of huge datasets.  Based on my personal analysis, I have noticed that when applying for a senior level position or a team lead position in data science, a doctorate gives you an additional three years of experience on top of what you have already.  Whereas if you lack a doctorate, you need a Master’s degree and three years of experience to be considered for that senior level position or a team lead position in data science.

Master levels courses in big data help build a strong mathematical, statistical, computational, and programming skills. Doctorate level courses help you learn and push the limits of knowledge in all these above mentioned fields, but also aid in becoming a domain expert in a particular area in data science.  Commanding that domain expertise, which is what you get through going through a doctoral program, can make you more valuable in the job market (Lo, n.d.).  Being more valuable in the job market can allow you to demand more in compensation.  Multiple sources of can quote multiple ranges for salaries, mostly because, this field has yet to be standardized (Lo, n.d.).  Thus, I would only provide two sources for salary ranges.

According to Columbus (2014), jobs that involve big data could include Big Data Solution Architect, Linux Systems and Big Data Engineer, Big Data Platform Engineer, Lead Software Engineer, Big Data (Java, Hadoop, SQL) have the following salary statistics:

  • Q1: $84,650
  • Median: $103,000
  • Q3: $121,300

Columbus (2014) also stated that it is very difficult to find the right people for an open requisite and that most requisites remain open for 47 days.  According to Columbus (2014), the most wanted skills for analytics data jobs based on of requisition postings in the field are: in Python (96.90% growth in demand in the past year), Linux and Hadoop (with 76% growth in demand, each).

Lo (n.d.) states that individuals with just a BS or MS degree and no full-time work experience should expect $50-75K whereas data scientist with experience can command up from $65-110K.

  • Data scientist can earn $85-170K
  • Data science/analytics managers can earn $90-140K for 1-3 direct reports
  • Data science/analytics managers can earn $130-175K for 4-9 direct reports
  • Data science/analytics managers can earn $160-240K for 10+ direct reports
  • Database Administrators can earn $50-120K, which varies upwards per more experience
  • Junior Big data engineers can earn $79-115K
  • Domain Expert Big data engineers can earn $100-165K

One way to look for opportunities in the field that are currently available is looking into the Gartner’s Magic Quadrant for Business Intelligence and Analytics Platforms (Parenteau et al., 2016). If you want to help push a tool into a higher ease of execution and completeness of vision as a data scientist consider employment in: Pyramid Analytics, Yellowfin, Platfora, Datawatch, Information Builders, Sisense, Board International, Salesforce, GoodData, Domo, Birst, SAS, Alteryx, SAP, MicroStrategy, Logi Analytics, IBM, ClearStory Data, Pentaho, TIBCO Software, BeyondCore, Qlik, Microsoft, and Tableau.  That is one way to look at this data.  Another way to look at this data is to see which tools are the best in the field and (Tableau, Qlik, Microsoft, with SAS Birst, Alterxyx, and SAP following behind) and learn those tools to to become more marketable.

Resources

Big Data Analytics: POTUS Report

This has become a data-centric society, relying on real-time data and technology (i.e., cell phone, shopping online, social networking) more than ever. Although there are many advantages associated with the use of this data, there are concerns that the collection of massive amounts of data can lead to an invasion of privacy. In January, 2014, President Obama asked his staff to take the next 90 days to prepare a report for him on how big data is affecting people’s privacy. This post revolves around this report.

The aims of big data analytics are for data scientist to fuse data from various data sources, various data types, and in huge amounts so that the data scientist could find relationships, identify patterns, and find anomalies.  Big data analytics can help provide either a descriptive, prescriptive, or predictive result to a specific research question.  Big data analytics isn’t perfect, and sometimes the results are not significant, and we must realize that correlation is not causation.  Regardless, there are a ton of benefits from big data analytics, and this is a field where policy has yet to catch up to the field to protect the nation from potential downsides while still promoting and maximizing benefits.

Policies for maximizing benefits while minimizing risk in public and private sector

In the private sector, companies can create detailed personal profiles will enable personalized services from a company to a consumer.  Interpreting personal profile data would allow a company to retain and command more of the market share, but it can also leave room for discrimination in pricing, services quality/type, and opportunities through “filter bubbles” (Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  Policy recommendation should help to encourage de-identifying personally identifiable information to a point that it would not lead to re-identification of the data. Current policies for the private sector for promoting privacy are (Podesta, et al., 2014):

  • Fair Credit Reporting Act, helps to promote fairness and privacy of credit and insurance information
  • Health insurance Portability and Accountably Act enables people to understand and control how personal health data is used
  • Gramm-Leach-Bliley Act, helps consumers of financial services have privacy
  • Children’s Online Privacy Protection Act minimizes the collection/use of children data under the age of 13
  • Consumer Privacy bill of rights is a privacy blueprint that aids in allowing people to understand what their personal data is being collected and used for that are consistent with their expectation.

In the public sector, we run into issues, when the government has collected information about their citizens for one purpose, to eventually, use that same citizen data for a different purpose (Podesta, et al., 2014).  This has the potential of the government to exert power eventually over certain types of citizens and tamper civil rights progress in the future.  Current policies in the public sector are (Podesta, et al., 2014):

  • The Affordable Care Act allows for building a better health care system from a “fee-for-service” program to a “fee-for-better-outcomes.” This has allowed for the use of big data analytics to promote preventative care rather than emergency care while reducing the use of that data to eliminate health care coverage for “pre-existing health conditions.”
  • The Family Education Rights and Privacy Act, the Protection of Pupil Rights Amendment and the Children’s Online Privacy Act help seal children educational records to prevent misuse of that data.

Identifying opportunities for big data in the economy, health, education, safety, energy-efficiency

In the economy, the use of the internet of things to equip parts of product with sensors to help monitor and transmit live, thousands of data points for sending alerts.  These alerts can tell us when maintenance is needed, for which part and where it is located, making the entire process save time and improving overall safety(Podesta, et al., 2014).

In medicine, the use of predictive analytics could be used to identify instances of insurance fraud, waste, and abuse, in real time saving more than $115M per year (Podesta, et al., 2014).  Another instance of using big data is for studying neonatal intensive care, to help use current data to create prescriptive results to determine which newborns are likely to come into contact with which infection and what would that outcome be (Podesta, et al., 2014).  Monitoring newborn’s heart rate and temperature along with other health indicators can alert doctors of an onset of an infection, to prevent it from getting out of hand. Huge amounts of genetic data sets are helping locate genetic variant to certain types of genetic diseases that were once hidden in our genetic code (Podesta, et al., 2014).

With regards to national safety and foreign interests, data scientist and data visualizers have been using data gathered by the military, to help commanders solve real operational challenges in the battlefield (Podesta, et al., 2014).  Using big data analytics on satellite data, surveillance data, and traffic flow data through roads, are making it easier to detect, obtain, and properly dispose of improvised explosive devices (IEDs).  The Department of Homeland Security is aiming to use big data analytics to identify threats as they enter the country and people of higher than the normal probability to conduct acts of violence within the country (Podesta, et al., 2014). Another safety-related used of big data analytics is the identification of human trafficking networks through analyzing the “deep web” (Podesta, et al., 2014).

Finally for energy-efficiency, understanding weather patterns and climate change, can help us understand our contribution to climate change based on our use of energy and natural resources. Analyzing traffic data, we can help improve energy efficiency and public safety in our current lighting infrastructure by dimming lights at appropriate times (Podesta, et al., 2014).  Energy efficiencies can be maximized within companies using big data analytics to control their direct, and indirect energy uses (through maximizing supply chains and monitoring equipment).  Another way we are moving to a more energy efficient future is when the government is partnering with the electric utility companies to provide businesses and families access to their personal energy usage in an easy to digest manner to allow people and companies make changes in their current consumption levels (Podesta, et al., 2014).

Protecting your own privacy outside of policy recommendation

In this report it is suggested that we can control our own privacy through using the browse in private function in most current internet browsers, this would help prevent the collection of personal data (Podesta, et al., 2014). But, this private browsing varies from internet browser to internet browser.  For important information like being denied employment, credit or insurance, consumers should be empowered to know why they were denied and should ask for that information (Podesta, et al., 2014).  Find out the reason why can allow people to address those issues in order to persevere in the future.  We can encrypt our communications as well, in order to protect our privacy, with the highest bit protection available.  We need to educate ourselves on how we should protect our personal data, digital literacy, and know how big data could be used and abused (Podesta, et al., 2014).  While we wait for currently policies to catch up with the time, we actually have more power on our own data and privacy than we know.

 

Reference:

Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J. & Zients,  J. (2014). Big Data: Seizing Opportunities, Preserving Values.  Executive Office of the President. Retrieved from https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

Big Data Analytics: Crime Fighting

Big data analytics can have a profound effect on the success of a business. Several case studies regarding this success can be found. This post will introduce and examine the content of the case study, explain what were the key problems that needed to be resolved, and identify key components that lead to the case’s success.

Case Study: Miami-Dade Police Department: New patterns offer breakthroughs for cold cases. 

Introduction:

Tourism is key to South Florida, bringing in $20B per year in a county of 2.5M people.  Robbery and the rise of other street crimes can hurt tourism and a 1/3 of the state’s sale tax revenue.  Thus, Lt. Arnold Palmer from the Robbery Investigation Police Department of Miami-Dade County teamed up with IT Services Bureau staff and IBM specialist to develop Blue PALMS (Predictive Analytics Lead Modeling Software), to help fight crime and protect the citizens and tourist to Miami-Dade County. When testing the tool it has achieved a 73% success rate when tested on 40 solved cases. The tool was developed because most crimes are usually committed by the same people who committed previous crimes.

 Key Problems:

  1. Cold cases needed to be solved and finally closed. Besides relying on old methods (mostly people skills and evidence gathering), patterns still could be missed, by even the most experienced officers.
  2. Other crimes like, robbery happen in predictable patterns (times of the day and location), which is explicit knowledge amongst the force. So, a tool shouldn’t tell them the location and the time of the next crime; the police need to know who did it, so a narrowed down list of who did it would help.
  3. The more experienced police officers are retiring, and their experience and knowledge leave with them. Thus, the tool that is developed must allow junior officers to ask the same questions of it and get the same answers as they would from asking those same questions to experienced officers.  Fortunately, the opportunity here is that newer officers come in with an embracing technology whenever they can, whereas veteran officers tread lightly when it comes to embracing technology.

Key Components to Success:

It comes to buy-in. Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

 Similar organizations could benefit:

Other policing counties in Florida, who have similar data collection measures as Miami-Dade County Police Departments would be a quick win (a short-term plan) for tool adoption.  Eventually, other police departments in Florida and other states can start adopting the tool, after more successes have been defined and shared by fellow police officers.  Police officers have a brotherhood mentality and as acceptance of this tool grows. Eventually it will reach critical mass and adoption of the tool will come much more quickly than it does today.  Other places similar to police departments that could benefit from this tool is firefighters, other emergency responders, FBI, and CIA.

 Resources:

Big Data Analytics: Open-Sourced Tools

It is critical that big data analysts are provided with software tools to effectively synthesize large data sets. Various software tools for analyzing big data have emerged in popularity over the past few years. This post will compare and contrast at least 3 software tools that I found most effective in analyzing big data. This discussion includes the advantages and disadvantages each application.

Here are three open source text mining software tools for analyzing unstructured big data:

  1. Carrot2
  2. Weka
  3. Apache OpenNLP.

One of the great things about these three software tools is that they are free.  Thus, there is no cost per each software solution.

 Carrot2

A Java based code, which also has a native integration with PHP, and C#/.NET API (Gonzalez-Aguilar & Ramirez Posada, 2012).  Carrot2 can organize a collection of documents into categories based on themes in a visual manner; it can also be used as a web clustering engine. Carpineto, Osinski, Romano, and Weiss (2009) stated that web clustering search engines like Carrot2 help you with fast subtopic retrievals, (i.e. searching for tiger, you can get tiger woods, tigers, Bengals, Bengals football team, etc.), Topic exploration (through a cluster hierarchy), and alleviation information overlook (does more than the first page of results search). The algorithms it uses for categorization is Lingo (Lingo3G), K-mean, and STC, which can support multiple language clustering, synonyms, etc. (Carrot, n.d.).  This software can be used online instead of regular search engines as well (Gonzalez-Aguilar & Ramirez Posada, 2012).  Gonzalez-Aguilar and Ramirez Posada (2012) explain that the interface has three phases for processing information: entry, filtration, and exit.  It represents the cluster data in three visual formats: Heatmap, Network, and pie chart.

The disadvantage of this tool is that it only does clustering analysis, but its advantage is that it can be applied to a search engine to facilitate faster and more accurate searches through its subtopic analysis.  If you would like to use Carrot2 as a search engine, go to http://search.carrot2.org/stable/search and try it out.

Weka

It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (Weka, n.d). Weka can be applied to big data (Weka, n.d.) and SQL Databases (Patel & Donga, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if you can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage.

 Apache OpenNLP

A Java code conventional machine learning toolkit, with tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution (OpenNLP, n.d.) OpenNLP works well with the NetBeans and Eclipse IDE, which helps in the development process.  This tool has dependencies on Maven, UIMA Annotators, and SNAPSHOT.

The advantage of OpenNLP is that specification of rules, constraints, and lexicons don’t need to be entered in manually. Thus, it is a machine learning method which aims to maximize entropy (Buyko, Wermter, Poprat, & Hahn, 2006).  Maximizing entropy allows for collect facts consistently and uniformly.  When the sentence splitter, tokenization, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution was tested on two medical corpora, accuracy was up in the high 90%s (Buyko et al., 2006).

This software has high accuracy as its advantage, but also produces quite a bit of false negatives which is its disadvantage.   In the sentence splitter function, it picked up literature citations, and in tokenization, it took specialized characters “-” and “/” (Buyko et al., 2006).

 References:

  • Buyko, E., Wermter, J., Poprat, M., & Hahn, U. (2006). Automatically adapting an NLP core engine to the biology domain. In Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data M ining in Association with ISMB (pp. 65-68).
  • Carpineto, C., Osinski, S., Romano, G., and Weiss, D. 2009. A survey of web clustering engines. ACM Comput. ´ Surv. 41, 3, Article 17 (July 2009), 38 pages. DOI = 10.1145/1541880.1541884 http://doi.acm.org/10.1145/1541880.1541884
  • Carrot (n.d.) Open source framework for building search clustering engines. Retrieved from http://project.carrot2.org/index.html
  • Gonzalez-Aguilar, A. AND Ramirez-Posada, M. (2012): Carrot2: Búsqueda y visualización de la información (in Spanish). El Profesional de la Informacion. Retrieved from http://project.carrot2.org/publications/gonzales-ramirez-2012.pdf
  • openNLP (n.d.) The Apache Software Foundation: OpenNLP. Retrieved from https://opennlp.apache.org/
  • Weka (n.d.) Weka 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).

Big Data Analytics: R

R has proven to be the most effective software tool for analyzing big data. This post will be quick discuss of my evaluation of this tool and its relevance in the big data arena, with respects to text mining.

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources:

Big Data Analytics: Installing R

Instructions to download, install, and become familiar with R, which is a free, open-source big data analytics software tool discussed in this week’s post.

I didn’t have any problems with the installation thanks to a video produced by Dr. Webb (2014).  It is a bigger package than what I thought it would be, so it can take a few minutes to download, depending on your download speed and internet connection. Thus,

(1)    For proper installation of R, you need to have administrative access on your computer.

(2)    Watch this video, to get a step-by-step instructions and an online tutorial to installing R and its graphical Integrated Development Environment (IDE).

  1. Note: The application for R 32x and 64x can be found at http://cran.r-project.org/
  2. Note: The Rstudio free “Desktop” graphical IDE can be found at http://www.rstudio.com/

(3)    Once installed Use the manual for this application at this site: http://cran.r-project.org/doc/manuals/R-intro.html

Once, I installed the software and the graphical IDE, I continued to follow along with the video to use the prepopulated Cars data under the “datasets” Packages, and I got the same result as shown in the video.  I also would like to note that Dr. Webb (2014) also had checked the Packages: “datasets,” “graphics,” “grDevices,” “methods,” and “stats” in the video, which can be hard to see depending on your video streaming resolution.

Resources:

Webb, J. (2014). Installing and Using the “R” Programming Language and RStudio. Retrieved from https://www.youtube.com/watch?v=77PgrZSHvws&feature=youtu.be

Big Data Analytics: Hadoop®

Hadoop® is used for distributed computing and can query large data sets based on its reliable and scalable architecture.
Two major components of Hadoop® are the Hadoop® Distributed File System (HFDS) and MapReduce. This post discusses the overall roles of these two components, including their role during system failures.

Hadoop® Distributed File System (HFDS):

HFDS big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs).

Example 1:

An example of HFDS stored data, is to think of a deck of cards, which each card holds information about what it is, value, color, symbol, etc.  HFDS can divide the data into blocks by A, 2, 3 … J, Q, & K, thus each block will hold about four card data each.  Thus, there are 13 distinct data blocks, which have been parsed by their value and placed on 13 different servers.  Let’s also assume I need higher than average availability, so rather than two copies, I need four copies of the J, Q, & K values, and 2 for A, 1, 2 … 10.  This is possible.  Each of the copies could be clustered in similar servers, or each can have one server on its own.  This type of redundancies in my data within HFDS has the benefit of higher availability of my data.  Thus, when I need to analyze my data on my deck of cards, I can say, the important values J, Q, & K have a higher chance of being available than my perceived lower value cards A, 2 … 10.

MapReduce:

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013 & Sathupadi, 2010). Reducers can work on different keys.  Huge amounts of data are entered into MapReduce, then the Mapper maps the data, then the data is shuffled and sorted before it is reduced.  Once the data is reduced, we get the output that we sought.

IBM’s (n.d.) MapReduce functions using the HFDS will run its procedures on the server in which the data is stored (also known as data locality).  Keeping in mind that HFDS has at least two backup copies, if one server goes down, which can happen, it can continue doing the tasks on the same data on a different server that is working.  This backup system for disaster recovery allows for high data availability.

Example 2:

Adjusted from Sathupadi (2010), is to look at how MapReduce can calculate the sum of all of Harvard Law Students and Medical Students current outstanding school loans per degree type.  Thus, the final output from our example would be Juris Doctorate (JD) Students Current Outstanding School Loan Amount and Latin Legum Magister (LLM) Students Current Outstanding School Loan Amount, and Doctor of Medicine (MD) School Loan Amount and Doctor of Osteopathic Medicine (DO) School Loan Amount.

If I ran this in Hadoop, a single copy of the data can be stored in 50 servers, and thus 50 nodes could be used to process this transaction request in parallel, speeding up the time it would take significantly but not by 50 fold.  The reason as to why not 50 fold is because it takes the time to reduce from mapping and nodes need to talk to each other, which slows down the speed of transaction.  So, running on X amount parallel never really is like saying we are X times faster, in reality, we are X-e times faster (where e is the transaction cost).

The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, Doctorate of Philosophy Students, Master Degree Students, etc.  Only JD, LLM, MD, and DO Students will get one key each assigned to them, keys that are similar to all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.  If data is duplicated at least twice on different servers, if a server were to go down, the MapReduce function will move on to a copy of that data in which can still be mapped and reduced.

 Resources: