Skip to content

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

License

Notifications You must be signed in to change notification settings

GalvanizeDataScience/python-for-data-analysis-1

Repository files navigation

Binder

Python for Data Analysis

Course materials for a multi day course on data analysis with Python using Pandas based on materials from "Python for Data Analysis, 3rd Edition" by Wes McKinney, published by O'Reilly Media. Book content including updates and errata fixes can be found for free on the author's website and is available for sale on Amazon.

Learning Objectives

The objective of this course is to provide students with an experimental approach, through practical experience, with data analysis using the Python programming language. The course is designed to provide students with practical experience with state-of-the-art data analysis tools that are widely used in industry.

This covers will cover the majority of Python for Data Analysis by Wes McKinney. On completion of this course students should be able to:

  • Recognize and select data types used in Python for data analysis;
  • Understand how to prepare data for further analysis using Pandas, Matplotlib, and Seaborn libraries;
  • Understand and apply data modelling and analysis workflows in Python;
  • Apply Python for real-world data analysis problems.

Lessons

Module 0

A whirlwind tutorial of the basics of the Python programming language. There module also covers a bit of IPython and Jupyter related topics sufficient to make learners comfortable with the programming environment prior to tackling the more advanced material presented in later modules. This material should be shared with students prior to the start of the course to review.

TutorialOpen in Google ColabOpen in Kaggle
Python Language BasicsGoogle ColabKaggle
Built-in Data Structures, Functions, and FilesGoogle ColabKaggle

Module 1

After completing this module learners should understand various data types used in data analysis in Python such as NumPy arrays, Pandas Series, and Pandas DataFrames. Learners should also be able to read (write) data from (to) storage in various formats using Pandas.

TutorialOpen in Google ColabOpen in Kaggle
NumPy BasicsGoogle ColabKaggle
Advanced NumPy (optional)Google ColabKaggle
Pandas BasicsGoogle ColabKaggle
Data Loading, Storage, and File FormatsGoogle ColabKaggle

Module 2

After completing this module, learners should understand how to prepare (i.e., clean, manipulate, aggregate, and visualize) data for further analysis. Learners will develop a knowledge of the Pandas API as well as a basic knowledge of plotting and visualizing of data with Matplotlib and Seaborn.

TutorialOpen in Google ColabOpen in Kaggle
Data Cleaning and PreparationGoogle ColabKaggle
Data WranglingGoogle ColabKaggle
Plotting and VisualizationGoogle ColabKaggle
Data Aggregation and Group OperationsGoogle ColabKaggle
Time SeriesGoogle ColabKaggle

Module 3

After competing this module learners will understand how to develop basic data modelling and analysis pipelines using Patsy, Statsmodels and Scikit-Learn.

TutorialOpen in Google ColabOpen in Kaggle
Defining Data Models using PatsyGoogle ColabKaggle
Statistics Approach to Data Modeling with StatsmodelsGoogle ColabKaggle
Machine Learning Approach to Data Modeling with Scikit-LearnGoogle ColabKaggle

Module 4

Finally, learners will also have an opportunity to apply the skills that they have learned to analyze real data. Typically, instructors should select 3 of the following projects to cover over one day of instruction.

TutorialOpen in Google ColabOpen in Kaggle
Data Analysis Example: Bitly Data from USA.govGoogle ColabKaggle
Data Analysis Example: MovieLens 1MGoogle ColabKaggle
Data Analysis Example: US Baby NamesGoogle ColabKaggle
Data Analysis Example: USDA Food DatabaseGoogle ColabKaggle
Data Analysis Example: 2012 Federal Election CommissionGoogle ColabKaggle

How to teach this course?

To get the most out of this material learners should have completed Python Crash Course prior to attempting this course (but this is not a strict prerequesite).

Instructors have a few options for teaching the material.

  1. Have the book open on an iPad (or similar); have the students open a new blank notebook; live code some (or all) the examples from the book and use the text of the book as speaking notes.
  2. Have the students open the book in their browser; have students open a blank notebook in another browser window and the have them read through relevant chapters of the book and code up the examples. Lead instructor and any teaching assistants are available to troubleshoot and answer individual questions. Common questions should be answered to the group as a live demo.
  3. Have the students open the book in their browser; have students open the provided notebooks in another browser window and the have them read through relevant chapters of the book and execute the provided code. Lead instructor and any teaching assistants are available to troubleshoot and answer individual questions. Common questions should be answered to the group as a live demo.
  4. Some combination of the above.

Approach 1 is the most difficult for the lead instructor but likely the most engaging for learners; option 3 is easier for both lead instructor and the students but likely results in the least learning. Option 2 is a middle ground: easier for the lead instructor but still requires students to write their own code.

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

About

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook99.9%
  • Python0.1%