Data Tools: XML

Large datasets are often represented using XML. Data and logs can be represented in XML. Markup and breaking marked sections are features of XML.

Advertisements

What is XML and how is it used to represent data

XML, also known as the eXtendend Markup Language, it is a standardized way that allows objects or data items to be referred to and identified by type(s) in a flexible hierarchical approach (Brookshear & Brylow, 2014; Lublinsky, Smith, & Yakubovich, 2013; McNurlin, Sprague, & Bui, 2008; Sadalage & Fowler, 2012).  XML refers to and identifies objects by types, when it assigns tags to certain parts of the data, defining the data (McNurlin et al., 2008).  JSON provides similar functionality to XML, the XML schema, and query capabilities are better than JSON (Sadalage & Fowler, 2012). XML focuses more on semantics than appearance, which allows for searches that understand the contents of the data being considered. Therefore it is considered to be a standard for producing markup languages like HTML (Brookshear & Brylow, 2014).  Finally, XML uses object principles, which tell you what data is needed to perform the function and what output they can give you, just not the how they will do it (McNurlin et al., 2008).

XML documents contain descriptions of a service or function, how to request the service or function, the data it needs to perform the work of the service or function, and the results the service or function will deliver (McNurlin et al., 2008). Also, relational databases have taken on XML as a structuring mechanism by taking on the XML document as a column type to allow for the creation of XML querying languages (Sadalage & Fowler, 2012).  Therefore, XML essentially helps in defining the data, giving the data meaning for which computers can manipulate and work on, and therefore transforming data from a human readable format into a computer readable, and can help with currency conversion, credit card processing application, etc. (McNurlin et al., 2008).

XML can handle a high volume of data and can represent all varieties of data, structured, unstructured, and semi-structured data in an active or live (streaming) fashion, such as CAD data use to design buildings; represent multimedia, house music, product bar code scanning, photographs of damaged property held by insurance companies, etc. (Brookshear & Brylow, 2014; Lublinsky et al., 2013 McNurlin et al., 2008; Sadalage & Fowler, 2012). By definition, big data is any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013).  Therefore XML can represent big data quite nicely.

Use of XML to represent data in various forms

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublisnky et al., 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Data can be comprised of text and numbers, like a telephone number (123)-456-7890, which can be represented in XML as <phone country = “U.S.”> 1234567890</phone>.  The country adds hierarchy to the object, defining it further as a U.S. phone number (Myer, 2005).  Given its hierarchical nature, the root data element helps sort all of the data below similar to a hierarchical data tree (Lublisnky et al., 2013; Myer, 2005).

Since, elements can be described in tags, which help add context to the data turning it into information, and when adding hierarchical structure to these informational elements, we are describing the natural relationships which aid in transforming information into knowledge (Myer, 2005). This can help aid in analyzing big data sets.

Myer (2005), provided the following example of an XML syntax, which can also showcase the XML representation of data in a hierarchical structure:

<Actor type=”superstar”>

                <name> Harrison Ford</name>

                <gender> male</gender>

                <age>50<age>

<Actor>

Given this simple structure, it can easily be created by humans or by code, and can thus be produced in a document. This is great, however, to derive the value that Myer (2005) was talking about from the XML formatted data, it will need to be ingested for analysis into Hadoop (Agrawal, 2014).  Finally, XML can deal with numerical data such as integers, real, float, long, double, NaN, INF, -INF, Probabilities, percentages, string data, plain arrays of values (numerical and string arrays), sparse arrays, matrices, sparse matrices, etc. (Data Mining Group, n.d.).  Thus, addressing various types of data as aforementioned, at high volumes.

Resources

  • Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/
  • Brookshear, G., & Brylow D. (2014). Computer Science: An Overview (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Data Mining Group (n.d.). PMML 4.1 – General structure. Retrieved from http://dmg.org/pmml/v4-1/GeneralStructure.html
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Myer, T. (2005). A really, really, really good introduction to xml. Retrieved from https://www.sitepoint.com/really-good-introduction-xml/
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Learning Solutions. VitalBook file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s