Skip to content

josephhu/GetdataProject

Repository files navigation

README for Human Activity Recognition (HAR) Tidy Data Set

This respository contain a tidy data set and the R script used to generate it.

Deliverables

The following information are provided:

  1. The Raw Data.
  2. A Tidy Data Set
  3. A Code Book describing each variable and its values in the tidy data set.
  4. An explicit and exact Recipe used to generate 2 and 3 from the original raw data.

The Raw Data

A full description is available at the site where the data was obtained.

The zipped data file is at https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

The original experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain. See 'features_info.txt' for more details.

The raw data had separated the 30 subjects into training and test sets.

With each set, the data were further separated into

  1. subject_train/test.txt that had only the subject id's (from 1 to 30)
  2. y_train/test.txt that had the activity values (from 1 to 6)
  3. X_train/test.txt that had 561 columns of measurements.

activity_labels.txt contain the names of the 6 activities. features.txt contain the names of the 561 measurements.

The Tidy Data Set

The Tidy Data Set was generated by computig the average (for each subject and each activity) of the measurements that were the mean and standard deviation of the original signals.

We define "mean" as column names that contain "-mean()" and standard deviation as column names that contain "-std()". Note that the original features_info.txt listed "meanFreq()" as "Weighted average of the frequency components to obtain a mean frequency" so based on this information, we did not include "-meanFreq()" columns.

Note the tidy data set file was created in R by: write.table(data.frame, "UCI_HAR_TidyData.txt", row.names=FALSE)

So the original data frame can be recreated in R as: data.frame <- read.table("UCI_HAR_TidyData.txt", header=TRUE)

The Code Book

The Code Book describes each variable, its values and units.

Similar to the original raw data set, a featues.txt file lists all the columns of the tidy data set.

The Recipe

The R script "run_analysis.R" contains all the R functions used to generate the Tidy Data Set. It should be executed in R by source("run_analysis.R") and then run()

  • run() is the main routine
  • download.data() downloads the original zip file to the working directory
  • unzip.data() unzips the zip file
  • merge.data() creates a big data frame of subject, activity, and 561 measurements
  • extract.mean.std() extracts out only subject, activity, and any measurements with names "-mean()" or "-std()"
  • gen.tidy.data() computes the average (per subject and activity) of the "-mean()" and "-std()" values
  • rename.columns() renames the tidy data set with descriptive variable names.

Detailed descriptions for each function are in the R script comments.

gen.tidy.set() uses 3 different approaches to calculate the average of each measurements.

  • melt()/dcast() from the reshape2 library
  • ddply() with colwise from the plyr library
  • aggregate() from base R

The project instruction calls for "descriptive activity names to name the activities in the data set" so the activity values of 1, 2, etc. are converted to strings WALKING, WALKING_UPSTAIRS, etc.

Because of the length of the original names, we chose not to convert everything to lower cases but rather we are using camel cases to make them clearer to read. Also, the riginal data set had some sensible abbreviations (such as Gyro for Gyroscope) and so we decided to keep these abbreviations.

The original feature name "tBodyAcc-mean()-X" is thus converted to "tBodyAccMeanX"

To emphasize that the tidy data set is the summaries (average of) of the original data, the column name is further renamed as "averageOf.tBodyAccMeanX" to clearly distinguish it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages