Adv DB: Web-DBMS

With 2.27 billion users of the web in 2012, we are talking about a significant portion of the world population (Connolly & Begg, 2014) and HTTP allows connection to between the web server, browser, and data.  The internet is platform-independent, but GUIs are developed for the end-user to access sites and information that they seek.  The web can be static, as those that present just information of the company (a webpage dedicated to presenting the Vision Statement of Pizza Hut) or can be considered dynamic, where it accepts user input and produces an output (Domino’s pizza form to order and pay for a pizza online and have it delivered 30 minutes from that order).  The latter can be considered as a Web-DBMS integration, which provides access to corporate data, connection to data, ability to connect to a database independently from a web browser, is scalable, allows for growth, allows for changes based on permission levels, etc. (Connolly & Begg, 2014).

Advantages and disadvantages of Web-DBMS

According to Connolly & Begg (2012), the advantages could be:

  • Platform independence: Browsers like Chrome, Firefox, Edge, etc. can be used to interpret basic HTML code and data connections to run without a need to modify the code to meet the needs for each Operating Systems (OS)/platforms that exists.
  • Graphical User Interface (GUI): Traditional access to databases is through command line based entries.  End users expect to see forms for inputting data, not SQL commands or command lines, they also expect to use search terms in their natural language rather than saying “Select * From …”.  Meeting the end user’s expectations and demands can mean high profits.
  • Standardization: It allows for a standard to be adopted in the back end and front end (GUI), since now this data set is visible to the world.  Doing so makes it easier to connect to the data and write code.  Also, using HTML it can allow any machine with connectivity and a browser to access the GUI and the data.
  • Scalable deployment: Separating the server from the client, allows for easy upgrading and managing across multiple OS/platforms
  • Cross-platform support: Safari is available for Macs, Edge is available for Microsoft, Firefox is available for all OS, etc. These browsers provide support for the service they provide, so the coders only need to worry about interfacing with them via HTML.  Thus, connecting to a data source via HTML is possible, with no need for writing code to deal with each OS/platform.

Whereas the disadvantages could be:

  • Security: SQL attacks can be made if there are text fields to be entered for form analysis, feedback, or comment gathering.
  • Cost: High cost of the infrastructure, Facebook collects user data, posts, images, videos, etc. and thus is very vital to have all that data backed up and available readily by all people with the right permissions. Also, the cost of staff to upkeep the database from being exploited from those not permitted to all the data that they have.  On average it can vary from $300K to $3.4M depending on the organization’s needs, market share, the purpose for their site.
  • Limited functionality of HTML or HTML5: Things/transactions that can be done easily in SQL are much harder based on what you can and cannot do in HTML and other code you can tie to HTML like Javascript, PHP, etc., which complicates the code overall. As the internet is moving more to an interactive and dynamic capability could be added to HTML to make the back end coding of Web-DBMS easier and provide a ton of functionality.
  • Bandwidth: If you have millions of people accessing your services online all at once, like Facebook, you must ensure that they can access the data with high Accessibility, Consistency, and Partition Tolerance.
  • Performance: A 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions according to Abadi in 2012. This can be brought on by bandwidth issues, or a security breach, slowing down your resources, creating a high cost.  Also, this interface makes it slower than if it were just directly connecting to a database through a traditional database client.

All these disadvantages that were mentioned are all interconnected, one thing can cause another.

The best approach to integrate Web and DBMS

Let’s take the Domino’s Pizza (ordering online service).  They could use a Microsoft Web Solution Platform (.NET, ASP, and ADO) to connect to their data sources, that is database platform independent (SQL, Oracle, etc.).  For example, the ASP is a programming model for dynamic end-user interaction, but .NET has many more tools, services and technologies for further end-user interaction with data, databases, and the websites (Connolly & Begg, 2014).  Using a web solution platform will give Domino’s a means of receiving inputs through a form, to write to their ordering database (which then can be used in the future for better inventory decision analysis), and taking credit card payments online.  Then once all that data has been taken in, an image/task completion bar can be shown to the customer on where their order is currently at in their process.  They can also save cookies to save end-user preferences for future orders and speeding up the user’s interaction with their page.

Resources

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/2

Adv DB: Indexes for query optimization

Information sought in a database can be extracted through a query.  However, the bigger the database, the slower the processing time it would take for a query to go through, hence query optimization techniques are conducted.  Another reason for optimization can occur with complex queries operations.

Rarely see that an index is applied on every column in every table

Using indices for query optimization is like using the index at the back of the book to help you find the information/topic you need quickly. You could always scan all the tables just like you can read the entire book, but that is not efficient (Nevarez, 2010).  You can use an index seek (ProductID = 77) or an index scan via adding an operand (ABS(ProductID) = 77), though a scan takes up more resources than a seek.  You can combine them (ProductID = 77 AND ABS(SalesOrderID) = 12345), where you would seek via ProductID and scan for SalesOrderID.  Indexing can be seen as an effective way to optimize your query, besides other methods like applying heuristic rules or ordering the query operations for efficient use of resources (Connolly & Begg, 2014).  However, indices not being used have no use to us, as they take up space on our system (Nevarez, 2010) which can slow down your operations.  Thus, they should be removed.  That is why indexing shouldn’t be applied to every column in every table.  Indexing in every column may not be necessary because it can also depend on the size of the table, indexing is not needed if the table is 3*4, but may be needed if a table is 30,000 * 12.

Thoughts on how to best manage data files in a database management system (DBMS)

Never assume, verify any changes you make with cold hard data. When considering how best to manage a database one must first learn if the data files or the data within the database are dynamic (users create, insert, update, delete regularly) or static (changes are minimal to non-existant) (Connolly & Begg, 2014).  Database administrators need to know when to fine-tune their databases with useful indices on tables that are widely used and turn off those that are not used at all.  Turning off those that are not used at all will saving space, optimize updated functions, and improving resource utilization (Nevarez, 2010). Knowing this will help us understand the nature of the database user. We can then re-write queries that are optimized via correct ordering of operations, removing unnecessary loops and do joins instead, how join, right join or left join properly, avoiding the wildcard (*) and call on data you need, and ensure proper use of internal temporary tables (those created on a server while querying).  Also, when timing queries, make sure to test the first run against itself and avoid the accidental time calculation which includes data stored in the cache. Also, caching your results, using the cache in your system when processing queries is ideal.  A disadvantage of creating too many tables in the same database is slower interaction times, so creating multiple databases with fewer tables (as best logic permits) may be a great way to help with caching your results (MySQL 5.5 Manual, 2004).

Resources

Adv DB: Transaction management and concurrency control

Transaction management, in a nutshell, is keeping track (serialized or scheduled) changes made to a database.  An overly simplistic example is debiting and crediting $100 and $110 dollars (respectively).  If the account balance is currently at $90, the order of this transaction is vital to avoid overdraft fees.  Now, concurrency control is used to ensure data integrity when a transaction occurs.  Thus making the two events interconnected.  Thus, in our example, serializing the transaction (all actions are done consecutively) is key.  You want to add the $110 dollars first so you have $200 in the account to then debit $100.  To do this you will need a timestamp ordering/serialization.  This became a terrible issue back in 2010 and is still an issue in 2014 (Kristof), where a survey of 44 major banks in which, half still re-order the transactions, which can result in draining account balances and causing overdraft fees.  The way they get around all of this is usually having processing times for deposits, which are typically longer than the processing times for charges.  Thus, even if done correctly serially, the processing time can per transaction vary so significantly that these issues happen.  According to Kristof (2014), banks say they do this to process payments in order of priority.

In the case above, it illustrates why this is why an optimistic concurrency control method is not helpful.  It is not helpful because they don’t check for serialization when doing the transactions initially (causing high cost on resources).  However, transactions in optimistic situations are done locally and validated against serialization before finalizing.  Here, if we started at the first of the month and paid a bunch of bills and then realized we were close to $0 so we deposited $110 and continued paying bills to the sum of $100, this can eat up a lot of processing time.  Thus it can get quite complicated quite quickly.  Conservative concurrency controls have the fewest number of abort and eliminates waste in processing via doing things in a serial nature, but you cannot run things in a parallel manner.

Huge amounts of data coming in like those from the internet of things (where databases need to be flexible and extensive because a projected trillion of different items would be producing data) would benefit greatly from the optimistic concurrency control.  Take the example of a Fitbit/Apple watch/Microsoft band.  It records data on you throughout the day.  However, the massive data is time-stamped and heterogeneous, it doesn’t matter if the data for sleep and walking are processed in parallel, but in the end, it is still validated.  This allows for a faster upload time through blue tooth and/or wifi environments.  Data can be actively extracted and explained in real-time, but when there are many sensors on the device, the data and sensors all have different forms of reasoning rules and semantic links between data, where existing or deductive links between sources exist (Sun & Jara, 2014) and that is where the true meaning of the generated data lies.  Sun & Jara suggests that a solid mathematics basis will help in ensuring correct and efficient data storage system and model.

Resources

Words matter: Customize to configure

Let’s look at some definitions when it comes to software development and the nuances each one plays:

Customize: to modify or supplement with code development internally to match end-user requests, it may not be preserved during an upgrade. This could be analogous to hacking into a game like Pokemon Go and enabling end-users the ability to spoof their locations, to obtain regional exclusive pocket monsters.

Tailoring: modifying or supplementing without code to enable a system into an environment.  Analogous to downloading Pokemon Go a Google play store or Apple app store, where the right version of the app is downloaded into the right environment.

Personalization: meeting the customers’ needs effectively and efficiently.  This is achieved by analyzing customer data and using predictive analytics.  A great way is using the Active Sync tool to encourage players of Pokemon Go to be more active, but realizing there are three tiers to active players and personalizing the rewards based on those levels that are achieved.  Personalization can also be seen with character customizations, clothing, poses, and buddy pokemon.

Configure: it is the process of setting up options and features tailored to meet implementation of business requirements.  In pokemon go, some people want to achieve a full pokedex, some like the gym system, some like the 1:1 battles, 1:1 trades, side quests, beating the villains, etc. You can configure your goals in the game by doing one or all, and you can do it to the amount that you want, meeting your requirements for satisfaction in playing the game.

Now if we want to think of these concepts on a continuum:

Customize <——- Tailoring ——- Personalization ——-> Configuring

where the cost of complexity decreases from right to left, constriction in growth decreases from right to left, and a decrease in profit margin occurs from right to left.

The question now becomes, its the additional complexity on this spectrum worth the extra cost incurred?

That is for you to decide.

 

 

Literature reviews

Side Note: This particular post was on my to-do list for a long time.

A literature review as a process containing a deep consideration of the current literature, to aid in identifying the current gaps in the existing knowledge, as well as building up the context for your research project (Gall, Gall, & Borg, 2006).  The literature review helps the researcher to build upon the works of other researchers, for the purpose of contributing to the collective knowledge. Our goal in the literature review will be undermined if we conduct any of the following common flaws (Gall et al., 2006):

  1. A literature review that becomes a standalone piece in the final document
  2. Analyzing results from studies that are not sound in their methodology
  3. Include the search procedures used to create this literature review
  4. Having only one study on particular ideas in the review, which may suggest the idea is not mature enough

For a literature review, one should be learning their field by reviewing the collective knowledge in the field by studying:

  • The beginning of {your topic}
  • The essence of {your topic}
  • Historical overview {your topic}
  • Politics of {your topic}
  • The Technology of {your topic}
  • Leaders in {your topic}
  • Current literature findings of {your topic}
  • Overview of research techniques {your topic}
  • The 21st century {your topic} Strategy

Creswell’s (2014), proposed that a literature map (similar to a mind map) of the research is a useful way to organize the literature, identify ideas with a small number of sources, determine the current issues in the existing knowledge, and determine the reviewers current gap in their understanding of the existing knowledge.  Finally, Creswell in 2014, listed what a good outline for a quantitative literature review should have:

  1. Introduction paragraph
  2. Review of topic one, which contains the independent variable(s).
  3. Review of topic two, which contains the dependent variable(s).
  4. Review of topic three, which provides how the independent variable(s) relate to the dependent variable(s).
  5. Summarize with highlights of key studies/major themes, to state why more research is needed.

Cresswell’s is generally a good method, but not the only one.  You can use a chronological literature review, where you build your story from the beginning to the present. In my dissertation, my literature review had to tie multiple topics into one: Big Data, Financial forecasting, and Hurricane forecasts.  I had to use the diffusion of innovation theory to transition between Financial and Hurricane forecast, to make the leap and justify the methodologies I will use later on.  In the end, you are the one that will be writing your literature review and the more of them you read, the easier it will be to define how you should write yours.

Here is a little gem I found during my second year in my dissertation: Dr. Guy White (2014) in the following youtube video has described a way to effectively and practically build your literature review. I use this technique all the time.  All of my friends that have seen this video have loved this method of putting together their literature reviews.

References

Internal validity in qualitative studies

Internal validity is determining the accuracy of the findings in qualitative research from the viewpoints of the researcher, participants or reader (Creswell, 2013). There are many validity strategies like: Triangulation of different data sources, member checking, rich thick description of the findings, clarifying any bias, presenting negative or discrepant information, prolong the time in the field, peer debriefing, external auditor to review the project, etc.

Triangulation of different data sources for observational work is an idea where I would examine evidence from multiple sources of data to justify the themes that I create through coding.  Converging themes from multiple sources of data and/or perspectives from participants would add to the validity of the study.  Thus, in order to increase the validity of the thematic codes would be to present the thematic codes from analysis of multiple sources like:

  • Interviews from N number of participants (until data saturation is reached)
  • Observations of the participants
    • Repeated observations will be taken, during multiple different types of shifts, with or without the same participants and during different random days of the week over a one-month period.
    • Observational Goals: Tracking what information is used (type and time stamps, instrumentations, etc.)
    • Observational Goals 2: Through videotaping, I hope to track conversations between participants sharing the same shift. Field notes would contain: “Why the conversation was initiated?”, “What was discussed?”, “Were there decisions made regarding the area of study”, “What is the bodily-based behavior portrayed by the specialists in the discussion?”, and “What was the outcome of that discussion?”
  • Document Analysis

The aforementioned, in particular, will help ensure internal validity in quiet a few studies.

 References:

Ethical issues involving human subjects

In Creswell (2013), it is stated that ethical issues can occur at all phases of the study (prior to the study, in the beginning, during data collection, analysis, and reporting).  Since we deal with data from people about people, we as researchers need to protect our participants and promote the integrity of research by guarding against misconduct and improperly reflecting the data.  Because we deal with people, it is our obligation to assure that interviewees do not get harmed as a result of our research (Rubin, 2012). The following anticipated risks are from Crewell (2013) and Rubin (2012):

  • Prior to conducting the study
    • We must seek an Institutional Review Board (IRB) approval before we conduct a study.
    • I must gain local permission from the agency, organization, corporation for which the study will take place and from the participants to conduct this study.
  • Beginning the study
    • We will not pressure participants to sign consent forms. To make sure that you have high participation rates, you need to make sure that the purpose of this study is compelling enough that the participants will see that it would be a value-added experience to them as well as to the field of study that they don’t want to say no.
      • We should also conduct an informal needs assessment to ensure that the participant’s needs are addressed in the study, to ensure a high participation rate.
      • But, we will tell the participants that they have the right not to sign the consent form.
    • Collecting data
      • Respecting the site and keep disruption to a minimum, especially if I am conducting observations. The goal of the observation in this study is not to be an active participant, but taking field notes of key interactions that occur while the participants are doing what they need to do.
      • Make sure that all the participants in the study receive the same treatment to avoid data quality issues while collecting it.
      • We should be respectful and straightforward to the participants.
      • Discuss the purpose of this study and how the data will be used with the participants is key to establishing trust and this would allow them to start thinking about the topic of the study. This can be accomplished by sending them an email prior to the interview as to the purpose of the study and the time we are requesting of them.
      • As we are asking our interviewing questions, we should avoid leading questions. That is why questions may be asked in a particular order.  In some cases, questions can build on one another.
      • We should avoid sharing personal impressions. Given that we know what the final questions in the interview are, as we should ask them questions while not giving any indication of what we are looking for so that they don’t end up contaminating our data.
      • Avoid disclosing sensitive or proprietary information.
    • Analyzing data
      • Avoid only disclosing one set of results, thus we must report on multiple perspectives and report contrary findings.
      • Keeping the privacy of the participants, assuring that the names have been removed from the results as well as any other identifying indicators.
      • Honor promises, if I offer to the participant a chance to read and correct their interviews, I should do so as soon as possible after the interview.
    • Reporting, sharing and storing data
      • Avoid situations where there is a temptation to falsify evidence, data, findings or conclusions. This can be accomplished through using unbiased language appropriate for audiences.
      • Avoid disclosing harmful information of the specialist.
      • Be able to have data in a shareable format, however with keeping the privacy of the specialist as the main priority, while keeping the raw data and other materials for 5 years in a secure location. Part of this data should consist of the complete proof of compliance, IRB, lack of conflict of interest, for if and when that is requested.

References:

Observational protocol and qualitative documentations

As a researcher, you could be a non-participant to a full-on participant when observing your subjects in a study.  Thus, the observed/empathized behavioral and activities of individuals in the study are jotted down in field notes (Creswell, 2013).  Most researchers use an observational protocol to jotting down these notes as they observe their subjects.  According to Creswell (2013), this protocol could consist of: “separate descriptive notes (portraits of the participants, a reconstruction of dialogue, a description of the physical setting, accounts of particular events, or activities) [to] reflective notes (the researcher’s personal thoughts, such as “speculation, feelings, problems, ideas, hunches, impressions, and prejudices), … this form might [have] demographic information about the time, place, and date of the field setting where the observation takes place.”

Whereas, observational work can be combined with in-depth interviewing, and sometimes the observational work (which can be an everyday activity) can help prepare the researcher for the interviews (Rubin, 2012).  Doing so can increase the quality of the interviews because the interviewers know what the researcher has seen or read and can provide more information on those materials.  This can also allow the researcher to master the terminology before entering the interview. Finally, Rubin (2012) also states that cultural norms become more visible through observation rather than just a pure in-depth interview.

In Creswell (2013), Qualitative Documents are information contained within documents that could help a researcher out in their study that could be either public (newspapers, meeting minutes, official reports) and/or private (personal journals/diaries, letters, emails, internal manuals, written procedures, etc.) documents.  This can also include pictures, videos, educational materials, books, files. Whereas, Artifact Analysis is the analysis of the written text, usually are charts, flow sheets, intake forms, reports, etc.

The main analysis approach of this document would be to read the document to gain a subject matter understanding.  Document analysis would aid in quickly grouping, sorting and resort the data obtained for a study.  This manual will not be included in the coded dataset, but will help provide appropriate codes/categories for the interview analysis, in other words give me suggestions about what might be related to what.   Finally, one way to interpret this document would be for triangulation of data (data from multiple sources that are highly correlated) between the observation, interviews and this document.   

References

Organizational research & Participant Observer

For organizational research, some of their major goals for research are to examine their formation, recruitment of talent, adaption to constraints, types and causes, factors for growth, change and demise, which all fall under ethnographic studies (Lofland, 2005).  Ethnographic studies lend themselves much more nicely to participant-observers.

Participant observer is where the researcher/observer is not just only watching their subjects, but also actively participates (joins in) with their subject. The level of participation of the observer might impact what is observed (the more participation the harder it is to observe and take notes), thus low-key role participation is preferred.  Participating before the interviews will allow the observer to be sensitive to important issues otherwise missed. It is a more in-depth version of interviewing building on a regular conversation.  Participation may occur after watching for a while, focusing on a specific topic/question. (Rubin, 2012)

References:

Data Analysis of Qualitative data

Each of the methods has at its core a thematic analysis of data, which is methodically and categorically linking data, phrases, sentences, paragraphs, etc. into a particular theme.  Coring up these themes by their thematic properties helps in understanding the data and developing meaningful themes aiding in building a conclusion to the central question.

Ethnographic Content Analysis (Herron, 2015):  Thick descriptions (collection of field notes that describe and recorded learning and a collection of perceptions of the researcher) help in the creation of cultural themes (themes related to behaviors on an underlying action) from which information was interpreted.

Phenomenological data analysis (Kerns, 2014): Connections among different classes of data through a thematic analysis were used for which results could be derived from.

Case study analysis (Hartsock, 2014): Through the organization of data within a specific case design and treating each distinct data set as a case study, one could derive some general themes within each individual case.  Once, all these general themes are identified, we should look for some cross-case themes.

Grounded Theory Data Analysis (Falciani-White, 2013): Code data through comparing incidents/data to a category (by breaking down, analyzing, comparing, labeling and categorizing data into meaningful units of data), and integrating categories by their properties, in order to help you identify a few themes in order to drive a theory in a systematic manner.

References: