What are the type of questions data science can answer?
We need well-formed questions.
How many students have GPA below 2.4
Who are the students, CMPT majors? minors? etc. Do part time students count? GPA, is it cumulative, or upper divison only etc.
Data in Python Notes
Numpy objects are faster than regular python objects because numpy obj are strongly typed, and are kept in C style arrays.
np.vectorize
applys np.ufunc into whole array .
Getting Data
Data format include csv, json, xml.
Could come from web api, databases. files.
Extract-Transform-Load
Extract
- take data from source, and load into your workspace
Transform
- fixing , cleaning,remove identification, etc.
janitorial work.
Load
- save for next pipeline.
Noise Filtering / Signal Processing
Part of cleaning, reality is our data can have noise covering up the truth.
Locally Weighted Scatterpot Smoothing (LOESS Smoothing)
- Given a set of data
- Take a local fraction of the data
- Fit a line through this fraction data. Using least square regression techniques (or others)
- use that line to be part of the curve for the middle of the neighbourhood
- continue with the next set of fractions by sliding the window along , generating a curve
LOESS is computational heavy because of constantly recalculating per the sliding window. It gives accurate data if work with lots of data.
LOESS, smaller fraction means smaller set of neighbors, so more sensitive to noise in that region
larger fraction means won't respond to signal changes as well.
What is a covariance matrix
Covariance measures how related item x is to y. Positive covariance means when x moves up, y moves up, Negative Convariance means when x moves up, y moves down. and 0 means they are not related at all
read up on covariance matrix here
This matrix just takes this idea, and extends it over multiple items. the matrix is symmetric that means when you transpose the table, it is the same values.
note unlike correlation, covariance is not standardized betwee -1 , 1, it's in the original unit.
Kalman Filtering
read article for explanation.
The filter works with two things, our observations (given by the sensor AKA Observation matrix
), and our predictions (given by our understanding , AKA transition matrix
) of what we expect to happen (the prior). Both are assumed to be normal distribution.
the observational covariance, R
is how error prone we think about our observations.
the lower the value, the less sensor errors are assumed, and observations have more effect on result,and more noise.
the higher the values, the less the noise,
the transitional covariance, Q
says what we think about our errors in our prediction.
the lower the value - the less prediction errors are assumed, predictions have more effect on the result, -> less noise
the higher the value - more noise
Using the exercise on GPS data as an example. We know the person can walk at most 1 meter per second, we also know the GPS sensor records a data every 5 seconds.
This means if the difference between two data point is more than 5 meters. then something is wrong with data.
In this example, the sensor is the observation, and our knowledge of walking speed is the prediction.
Other filtering algorithms
Low-pass or high-pass: keeps the low or high frequency and discard the opposite.
Butterworth : not gone over in detail
Cleaning / Handling Outlier
Outliers are tricky, need to be sane about whether the outlier point should kept in (because it is a valid data) or kept out (because it's invalid due to sensor precision)
Some actions you can do to "outliers"
- Leave it as is, because it's valid
- Remove the record, because it's invalid
- Remove the value and treat it as missing record
- Impute (which is replace them by calculating a plausible value)
Common ways to impute use nearby values use average of known data * use linear regerssion
may need to impute when you don't want to throw away the record just because of a low important variable.
Regular Expression
python has built in regular expression Python re.
- r'x' matches an x character
- r'.' matches any single character
- r'.' matches a single . character
- r'x*' matches 0 or more x
- r'x+' matches 1 or more x
- r'x?' matches x but it's optional (0 or 1)
- r'^' matches start of string
- r'$' matches end of string
- r'[ab.]+' matches 1+ a or b or .
- r'\d+' matches 1+ digits
- r'\s*' matches 0+ spaces ([ \t\n\r\f\v])
- r'\S*' matches 0+ non-spaces character
Hypothesis Testing
\(H_o\) is null hypothesis, \(H_A\) is the alternate hypothesis, these two cases should cover all scenarios.
We assume \(H_o\) is true, and we test to see what's the probability, \(p value\), that it is true. If it is under an predetermined threshold \(alpha\) then we can reject \(H_o\) , otherwise we fail to reject \(H_o\)
We need to make sure we are not running tests over and over until we reach a \(p<0.05\) This is a dishonest way to run experiments. (Remember 0.05 is 1/20 pure chance we will get \(p<0.05\))
Student T-Test
\(H_o\) the two groups have similar means
- Sample is a representative of the population
- Assumes the sample are independent and IID
- Population are normally distributed
- Population has the same variance
If the sample doesn't have these properties, then T-Test would not do any good for you.
Normal Test
We can use the scipy.stats.normaltest to test for normality. the \(H_o\) is that sample is normally distributed
if data does not pass the normality test, we can transform the data by taking log, or square, or square root, to try for normality again
Levenes Test
test to see if two samples have the same variance
Type 1 Error
This is when we incorrectly reject the null hypothesis
Bonferroni Correction
This is used to account for when we are applying multiple hypothesis testings together. We increase our chance of Type 1 errors because we distort our alpha.
For example, We have three tests with confidence interval of .95 each. that's .95^3 = .86 odds of no incorrect rejection. Bonferroni correct fixes this.
Tukey Honest Significant Different ( HSD) Test
THis is another way to compare multiple groups of data, instead of just using different T-Test pairs. But Tukey HSD takes in consideration of the number of groups being compared.
Analaysis of Variance (ANOVA)
tests to see if any of the group have different means.
Assumes group have equal variance, are normally distributed, and IID.
After ANOVA produces a result, if the result is significant, you can perform a Post Hoc Analysis such as Tukey HSD to find out which group is different.
Non Parametric Testing
Mann Whitney U Test vs Chi Square vs Regression
These tests are used to compare datasets that are not normally distributed.
Mann Whitney - test whether one group is larger / smaller than another. the values need to be ordinal, and indepedent observations. The idea is if we merge the two datasets together then sort. the output should be even shuffled. works with data we can ordinal data best, Chi Square - works in category of data, forms a contingency table, and sees how out of proportion your data is. whats the chance your variance in the data set is due to chance. Degree of freedom is related to how many type of samples you got. Regression - this is an inference test too, the null hypothesis is that the slope of the line is 0, ie, y does not depend on x.
Machine Learning Algorithm
Regression (Predict a number)
Taking an input and produce a quantity most relevant to the model. There is linear regression, then there is polynomial regression. Usually the higher degree the input allows closer fit to the data.
but having too close a fit (overfit) is bad too, because it means the model is good at predicting the training set, but not much anything else.
Naive Baye:
Create predictions based on multiplying the likelihoods of various probabilities together
Baye's Theorem \(P(A|B) = P(B|A) * P(A)/P(B) \)
Note, the input features have to be independent for this work, hence the name naive.
Nearest Neighbours
What k nearest training points are we closest to.
smaller neighbours result in overfitting the data, and larger neighbours underfit reality.
Support Vector Machines
Generate the best line that has the largest margin with no points inside. This line divides the classifiers.
with SVM, we need to decide between large margin with many smaller bad points, or smaller margins with few bad points. (soft margin or hard margin respectively)
if the data is not linearly separatable, we can add polynomial features (through the polynomial kernel) for SVM.
the kernel in a SVM adds features to the model input. a kernel can be non-linear, RBF Kernel is another option
Neural Net
- Take an input (like pixels in an image)
-
feed them into a layer of weights and biases, at each node, we can think of weights as the strength of the various inputs , and bias as how likely this node will be fired.
-
have an activation function to normalize the result (squeezes the range of numbers into just 0,and 1)
- the network refers to how the data are arrange in layers, each layer influences the next layer to do stuff
- when we say learning, it refers to how to properly tune the weights and biases.
we add extra layers of computation to do more complex decisions using back propagation techniques.
The models need initial weights assigned, then use training data to improve them until the NN converge to good values.
Principal Component Analysis (PCA)
We do PCA to reduce number of dimensions. It does so by finding the vector along the which data has the maximum variance, and continously collapse the data along the vector until out of dimensions.
Checking classifier accuracy
Precision how many selected were correct. Recall how many correct were found.
Preprocessing
SKLearn MinMax - organize data from 0 to 1, use when distribution is not gaussian or SD is very small \((x_i-min(x))/(max(x)-min(x))\)
StandardScaler - scales your data so distribution is centred around 0, and standard deviation of 1. \((x_i-mean(x))/stdev(x)\)
Feature Scaling
MinMaxScaling is
\((X - X_{min}) / (X_{max} - X_{min})\)
StandardScaling is
\(z = (x - mean) / SD\)
so it has mean of 0, and SD of 1.
Feature Engineering
Machine learning and colour prediction needed feature engineering to get better results (we tested ML using RGB colouring and LAB colouring.)
Supervised learning
we know the result and train the model for it
Unsupervised learning
where there is no right answer known but the algorithm tries to find structure in data. Clustering is related to this category
A few clustering algorithm include KMeans(), AgglomerativeClustering, AffinityPropagation
Anomaly Detection is another unsupervise technique, like spot the spam , or attacker on server, credit card charges, etc.
Spark
Spark dataframes are implemented in Scala, which compiles in Java Virtual Machine. the python code are just for building execution plans.
pySpark and Pandas has some synergy, we can create a spark dataframe with pandas dataframe as a parameter, or python list,
data = spark.createDataFrame(pd_DF)
and convert back to pandas with :::python pd_data = data.toPandas()s
Driver - program you are writing. Executor - manages the data in spark dataframe
in a local spark, one driver and \(n\) executor threads.
on a cluster, driver runs on gateway and YARN starts executors on the cluster nodes, the hadoop distributed file system (HDFS) stores data on cluster nodes. the spark-submit
keyword interacts with YARN
Partition
data in RDD are split in partitions, partitions are configurable, and each partition never span multiple machines.
we want partitions to be similar size, so that executors don't sit idle
in general we want between 100 and 10000 partitions
we can explicitly declare for number of partitions like so
x = spark.range(10000, numPartitions=6)
we can combine partitions together with .coalesce(n)
because partitions can be stored on different machines, we need to be careful about shuffle operations, which involves moving data across partitions across different machines. be careful of operations like .repartition()
.groupby()
operations that can be pipelined, such as .select(), .filter(), .withColumn(), .drop(), .sample()
are ok
Execution Plan and Lazy Evaluation
although GroupBy
is a shuffling operation, it's less severe because the spark is optimized for per partition aggregation, we can see by the spark.explain()
line.
Spark uses lazy evaluation, which means only the relevant code are run. For example, since .show()
only displays the top 20 rows, spark only calculates for 20 rows, not the whole dataframe.
we can cache
intermediate results that we know we will use later on, the cache
keyword lets spark know this.
note we can run SQL queries against spark dataframes by the .createOrReplaceTempView(tableName)
then we can query against spark.sql("SELECT foo FROM tableName")
with SQL we can't take advantage of caching.
its good to cache before multiple calls, like this
int_range = spark.range(..)
values = int_range.select(..).cache()
result1 = values.groupBy(..).agg(..)
result2 = values.groupBy(..).agg(..)
Spark Joins
join
is a shuffling operations, so we need to be careful, if we have a really small table joining a large table, we would broadcast the small table.
small_tbl = functions.broadcast(small_tbl)
joined_data = big_tbl.join(small_tbl, on='id')
Spark User Defined Functions
this is used when we want to run our own python functions (or python exclusive libraries such as RGB to LAB) against a column
def complicate_function(a,b):
return a + 2*b
complicated_udf = functions.udf(complicate_function,
returnType=types.IntegerType())
the UDF logic is sent out to the executor, and converted from JVM representation into python, called in python process, and result sent back into JVM
Resilient Distributed Dataset RDD
the underlying data structure for Spark, is one-dimensional, and holds collection of whatever value we put in.
Dataframes are implemented as Row
objects of RDD, to work with RDD in python, these are the key words
sc = spark.sparkContext
rdd = sc.textFile('')
pprint(rdd.take(6))
rdd.take(n)
retrieves the first n elements as python list - similar to df.show(n)
rdd.map(f)
applys a function f to each element - similar to d.select(..)
rdd.filter(f)
applys a function f to each element, keeps row where returned True - similar to df.filter()
Numpy / Pandas speed
pandas stores columns in contiguous column, so accessing columns is fast
df['col'].values
# is faster than
df.iloc[0] # a row object
# using numpy libraries is much faster
# than using math libraries
np.sin(df['a'])
# rather than
def do_work(a):
math.sin(a)
df['a'].apply(do_work)
numexpr package has own expression syntax compiled internally, and allows for even faster running time
data are stored in columns in contiguous blocks, so accessing columns are fast, rows needs to be constructed so row operations are slow.
from fast to slow
- numpexpr
- numpy expressions
- vectorize
- series.apply
- dataframe.apply
- python loop
Exercise 6
stats.chi2_contingency()
takes in a pandas pivot table
stats.mannWhitneyu()
takes in two series
did anova analysis using stats.f_oneway(lists of data)
to do pairwise_tukey, we had to use 'melt' to transform wide data into long data
Exercise 7
Exercise 10
Part 1
intro to pyspark, we learned spark dataframe, keyword is spark.read.json
to take in a json file and turn into dataframe.
groupBy().mean().sort()
commands can be chained
Exercise 11
Part 1
We used spark dataframe to get reddit comment relative scores, the main point of the exercise was to practice placing caches at the right spot. we learn that caching before multiple operations is the best, predefine the schema is helpful too as it defines the data type of each column
broadcasting a small table to a big table is better than joining.
Part 2
here we practice using RDD to clean data. spark.sparkContext.textFile
gets the RDD, then we can map
and filter
to clean the RDD dataset
we added columns to a spark dataframe with the df.withColumn(newColumnName, col operations)
keyword
Exercise 12
we used pySpark to get a word count of the most common words in a few novels. the key methods we used were
spark.df.explode
which transposes one row into multiple rows
spark.df.split
which splits one row value base on a regex string