diff --git a/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd b/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd index 8370812..22e87ba 100644 --- a/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd +++ b/1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd @@ -20,13 +20,14 @@ $\pagebreak$ * `pwd` = print working directory (current directory) * `clear` = clear screen * `ls` = list stuff - * `-a` = see all (hidden) + * `-a` = see all (including hidden files) * `-l` = details * `cd` = change directory * `mkdir` = make directory * `touch` = creates an empty file * `cp` = copy * `cp ` = copy a file to a directory + * `cp ` = rename a file * `cp -r ` = copy all documents from directory to new Directory * `-r` = recursive * `rm` = remove @@ -102,7 +103,7 @@ $\pagebreak$ * **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data) ## Experimental Design -* Formulate you question in advance +* Formulate your question in advance * **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly * ***[Inference]*** **Variability** = lower variability + clearer differences = decision * ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation) @@ -118,5 +119,5 @@ $\pagebreak$ * **Accuracy** = Pr(correct outcome) * **Data dredging** = use data to fit hypothesis * **Good experiments** = have replication, measure variability, generalize problem, transparent -* Prediction is not inference, and be ware of data dredging +* Prediction is not inference, and beware of data dredging diff --git a/2_RPROG/R Programming Course Notes.Rmd b/2_RPROG/R Programming Course Notes.Rmd index 9a7ba87..cb26a26 100644 --- a/2_RPROG/R Programming Course Notes.Rmd +++ b/2_RPROG/R Programming Course Notes.Rmd @@ -2,13 +2,13 @@ title: "R Programming Course Notes" author: "Xing Su" output: - pdf_document: - toc: yes - toc_depth: 3 html_document: highlight: pygments theme: spacelab toc: yes + pdf_document: + toc: yes + toc_depth: 3 --- $\pagebreak$ @@ -360,7 +360,7 @@ $\pagebreak$ * ***examples*** * `apply(x, 1, sum)` or `apply(x, 1, mean)` = find row sums/means * `apply(x, 2, sum)` or `apply(x, 2, mean)` = find column sums/means - * `apply(x, 1, quantile, props = c(0.25, 0.75))` = find 25% 75% percentile of each row + * `apply(x, 1, quantile, probs = c(0.25, 0.75))` = find 25% 75% percentile of each row * `a <- array(rnorm(2*2*10), c(2, 2, 10))` = create 10 2x2 matrix * `apply(a, c(1, 2), mean)` = returns the means of 10 @@ -551,7 +551,7 @@ $\pagebreak$ ### Larger Tables * ***Note**: help page for read.table important* * need to know how much RAM is required $\rightarrow$ calculating memory requirements - * `numRow` x `numCol` x 8 bytes/numeric value = size required in bites + * `numRow` x `numCol` x 8 bytes/numeric value = size required in bytes * double the above results and convert into GB = amount of memory recommended * set `comment.char = ""` to save time if there are no comments in the file * specifying `colClasses` can make reading data much faster diff --git a/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd b/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd index ad298f3..4f342da 100644 --- a/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd +++ b/3_GETDATA/Getting and Cleaning Data Course Notes.Rmd @@ -63,7 +63,7 @@ $\pagebreak$ * ***Relative***: `setwd("./data")`, `setwd("../")` = move up in directory * ***Absolute***: `setwd("/User/Name/data")` * **Check if file exists and download file** - * `if(!file.exists("data"){dir.create("data")}` + * `if(!file.exists("./data")) {dir.create("./data")}` * **Download file** * `download.file(url, destfile= "directory/filname.extension", method = "curl")` * `method = "curl"` [mac only for https] @@ -120,7 +120,7 @@ $\pagebreak$ * `xpathSApply(rootNode, "//price", xmlValue)` = get the values of all elements with tag "price" * **extract content by attributes** * `doc <- htmlTreeParse(url, useInternal = True)` - * `scores <- xpathSApply(doc, "//li@class='score'", xmlvalue)` = look for li elements with `class = "score"` and return their value + * `scores <- xpathSApply(doc, "//li[@class='score']", xmlvalue)` = look for li elements with `class = "score"` and return their value @@ -153,14 +153,14 @@ $\pagebreak$ ## data.table * inherits from `data.frame` (external package) $\rightarrow$ all functions that accept `data.frame` work on `data.table` * can be much faster (written in C), ***much much faster*** at subsetting/grouping/updating -* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c(a, b, c), each = 3), z = rnorm(9)` +* **syntax**: `dt <- data.table(x = rnorm(9), y = rep(c("a","b","c"), each = 3), z = rnorm(9))` * `tables()` = returns all data tables in memory * shows name, nrow, MB, cols, key * some subset works like before = `dt[2, ], dt[dt$y=="a",]` * `dt[c(2, 3)]` = subset by rows, rows 2 and 3 in this case * **column subsetting** (modified for `data.table`) * argument after comma is called an ***expression*** (collection of statements enclosed in `{}`) - * `dt[, list(means(x), sum(z)]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example) + * `dt[, list(mean(x), sum(z))]` = returns mean of x column and sum of z column (no `""` needed to specify column names, x and z in example) * `dt[, table(y)]` = get table of y value (perform any functions) * **add new columns** * `dt[, w:=z^2]` @@ -176,9 +176,9 @@ $\pagebreak$ * **special variables** * `.N` = returns integer, length 1, containing the number (essentially count) * `dt <- data.table (x=sample(letters[1:3], 1E5, TRUE))` = generates data table - * `dt[, .N by =x]` = creates a table to count observations by the value of x + * `dt[, .N, by =x]` = creates a table to count observations by the value of x * **keys** (quickly filter/subset) - * *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each 100), y = rnorm(300))` = generates data table + * *example*: `dt <- data.table(x = rep(c("a", "b", "c"), each = 100), y = rnorm(300))` = generates data table * `setkey(dt, x)` = set the key to the x column * `dt['a']` = returns a data frame, where x = 'a' (effectively filter) * **joins** (merging tables) @@ -187,9 +187,9 @@ $\pagebreak$ * `setkey(dt1, x); setkey(dt2, x)` = sets the keys for both data tables to be column x * `merge(dt1, dt2)` = returns a table, combine the two tables using column x, filtering to only the values that match up between common elements the two x columns (i.e. 'a') and the data is merged together * **fast reading of files** - * *example*: `big_df <- data.frame(norm(1e6), norm(1e6))` = generates data table + * *example*: `big_df <- data.frame(rnorm(1e6), rnorm(1e6))` = generates data table * `file <- tempfile()` = generates empty temp file - * `write.table(big.df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t". quote = FALSE)` = writes the generated data from big.df to the empty temp file + * `write.table(big_df, file=file, row.names=FALSE, col.names = TRUE, sep = "\t", quote = FALSE)` = writes the generated data from big.df to the empty temp file * `fread(file)` = read file and load data = much faster than `read.table()` @@ -202,7 +202,7 @@ $\pagebreak$ * free/widely used open sources database software, widely used for Internet base applications * each row = record * data are structured in databases $\rightarrow$ series tables (dataset) $\rightarrow$ fields (columns in dataset) -* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu)` = open a connection to the database +* `dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")` = open a connection to the database * `db = "hg19"` = select specific database * `MySQL()` can be replaced with other arguments to use other data structures * `dbGetQuery(db, "show databases;")` = return the result from the specified SQL query executed through the connection @@ -473,7 +473,7 @@ $\pagebreak$ ## Subsetting and Sorting * **subsetting** * `x <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = (11:15))` = initiates a data frame with three names columns - * `x <- x[sample(1:5)` = this scrambles the rows + * `x <- x[sample(1:5),]` = this scrambles the rows * `x$var2[c(2,3)] = NA` = setting the 2nd and 3rd element of the second column to NA * `x[1:2, "var2"]` = subsetting the first two row of the the second column * `x[(x$var1 <= 3 | x$var3 > 15), ]` = return all rows of x where the first column is less than or equal to three or where the third column is bigger than 15 diff --git a/7_REGMODS/Regression Models Course Notes.Rmd b/7_REGMODS/Regression Models Course Notes.Rmd index 16539ed..299954e 100644 --- a/7_REGMODS/Regression Models Course Notes.Rmd +++ b/7_REGMODS/Regression Models Course Notes.Rmd @@ -743,13 +743,14 @@ $\pagebreak$ ### Intervals/Tests for Coefficients * standard errors for coefficients $$\begin{aligned} -Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\ -(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{((X_i - \bar X)^2)}\right) \\ -& Since~ \sum_{i=1}^n X_i - \bar X = 0 \\ -(simplifying) & = \frac{\sum_{i=1}^n Y_i (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\ +Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\ +(expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2}\right) \\ +& Since~ \sum_{i=1}^n (X_i - \bar X) = 0 \\ +(simplifying) & = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X)\right)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\ +& Since~ Var\left(\sum aY\right) = \sum a^2 Var\left(Y\right) \\ (Var(Y_i) = \sigma^2) & = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar X)^2}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \\ \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (X_i - \bar X)^2 }\\ -\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sum_{i=1}^n X_i - \bar X} \\ +\Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sqrt {\sum_{i=1}^n (X_i - \bar X)^2}} \\ \\ \mbox{by the same derivation} \Rightarrow & \\ \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) & = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \\