上QQ阅读APP看书，第一时间看更新

Introducing data munging

Raw data is typically messy and requires a series of transformations before it becomes useful for modeling and analysis work. Such Datasets can have missing data, duplicate records, corrupted data, incomplete records, and so on. In its simplest form, data munging, or data wrangling, is basically the transformation of raw data into a usable format. In most projects, this is the most challenging and time-consuming step.

However, without data munging your project can reduce to a garbage-in, garbage-out scenario.

Typically, you will execute a bunch of functions and processes such as subset, filter, aggregate, sort, merge, reshape, and so on. In addition, you will also do type conversions, add new fields/columns, rename fields/columns, and so on.

A large project can comprise of several different kinds of data with varying degrees of data quality. There can be a mix of numerical, textual, time-series, structured, and unstructured data including audio and video data used together or separately for analysis. A substantial part of such projects consist of cleansing and transformation steps combined with some statistical analyses and visualization.

We will use several Datasets to demonstrate the key data munging techniques required for preparing the data for subsequent modeling and analyses. These Datasets and their sources are listed as follows:

Individual household electric power consumption Dataset: The original source for the Dataset provided by Georges Hebrail, Senior Researcher, EDF R&D, Clamart, France and Alice Berard, TELECOM ParisTech Master of Engineering Internship at EDF R&D, Clamart, France. The Dataset consists of measurements of electric power consumption in one household at one-minute intervals for a period of nearly four years. This Dataset is available for download from the UCI Machine Learning Repository from the following URL:

https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.

Machine Learning based ZZAlpha Ltd Stock Recommendations 2012-2014 Dataset: This Dataset contains recommendations made, for various US traded stock portfolios, the morning of each day during a three year period from Jan 1, 2012 to Dec 31, 2014. This Dataset can be downloaded from the following URL:

https://archive.ics.uci.edu/ml/datasets/Machine+Learning+based+ZZAlpha+Ltd.+Stock+Recommendations+2012-2014.

Paris weather history Dataset: This Dataset contains the daily weather report for Paris. We downloaded historical data covering the same time period as in the household electric power consumption Dataset. This Dataset can downloaded from the following URL:

https://www.wunderground.com/history/airport/LFPG.

Original 20 newsgroups data: This data set consists of 20,000 messages taken from 20 Usenet newsgroups. The original owner and donor of this Dataset was Tom Mitchell, School of Computer Science, Carnegie Mellon University. Approximately a thousand Usenet articles were taken from each of the 20 newsgroups. Each newsgroup is stored in a subdirectory and each article stored as a separate file. The Dataset can be downloaded from the following URL:

http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.

Yahoo finance data: This Dataset comprises of historical daily stock prices for six stocks for one year duration (from 12/04/2015 to 12/04/2016). The data for each of the ticker symbols chosen can been downloaded from the following site:

http://finance.yahoo.com/.