VSCSE Data Intensive Summer School, June 30-July 2, 2014

Where: Lubar S250, University of Wisconsin-Milwaukee
When: June 30-July 2, 10:00 a.m. to 4:00 p.m.

Parking is available in the UWM Student Union lot. Lunch will be on your own, and there are several dining options in the nearby Student Union

If you registered, but are unable to attend, please let us know so we can allow someone else to take advantage of this opportunity.

Prepare your laptop computer in advance

In the best interest of time, our instructors requested that everyone review the following links, and download and/or install content before arriving to class on June 30:

  • R Studio (statistical programming language):  Follow “download RStudio Desktop” http://www.rstudio.com/ide/download
  • WEKA (data mining software). Follow “Download” link on left hand side of home page: http://www.cs.waikato.ac.nz/ml/weka. Download the Stable book 3rd ed. (NOTE: it isn't apparent from the organizers' information where you can download this book, but we did confirm that you do NOT need a book. If you find it online, it would be useful.).
  • Prior knowledge of R is not required, but we do assume that you have some programming experience and familiarity with basic programming concepts (variables, arrays, loops, branching, etc.). You may find it helpful to acquaint yourself with basic R syntax ahead of time. Reading the first two chapters of the following online introduction is recommended http://cran.r-project.org/doc/manuals/R-intro.html
  • A basic understanding of relational databases and SQL would be useful. If you are unfamiliar with the SQL syntax, please consider the following tutorials: http://sqlzoo.net  and http://www.w3schools.com/sql/sql_intro.asp .
  • KNIME: On the third day, we will explore KNIME, an easy-to-use, visual programming language that's popular in predictive analytics and text-mining communities. However, prior knowledge of KNIME is not required. http://www.knime.org/

Examples and assignments may involve the modification of short, well-documented blocks of code.


Agenda

Monday, June 30

10:00-10:15: Introduction and overview. Robert Sinkovits, Gordon Applications Director, San Diego Supercomputer Center (SDSC).

10:15-11:15: Globus Online for research data management. Rachana Ananthakrishnan, Computation Institute, University of Chicago.

11:15-12:00: Workflows and data provenance. Illkay Altintas, director of the Scientific Workflow Automation Technologies Laboratory at the SDSC.

12:00-12:15: Break

12:15-1:15: Workflows and data provenance, continued. 

1:15-1:45: Lunch

1:45-4:00: Workflows and data provenance, continued with optional break ~ 1:00 p.m.


Tuesday, July 1

10:00-11:00: File systems, hardware and the nuts and  bolts of storage. Rick Wagner (SCSC).

11:00-12:00: Working with big data. Amarnath Gupta (SDSC) and Bill West (SDSC).

12:00-12:15: Break

12:15-1:15: Working with big data, continued. Gupta and West.

1:15-1:45: Lunch

1:45-4:00: Working with big data, continued with optional break ~ 1:00.Gupta and West.


Wednesday, July 2

10:00-11:00: Introduction to predictive analytics and data mining. Natasha Balac, director of SDSC's Predictive Analytics Center (PAC).

11:00-11:30: Overview of data mining tools. Nicole Wolter (SDSC-PAC). 

11:30-11:45: Break

11:45-1:15: Unsupervised learning (PCA and clustering). Paul Rodriguez (SDSC-PAC).

1:15-1:45: Lunch

1:45-2:45: Supervised learning (decision trees). Wolter and Balac.

2:45-3:00: Break

3:00-4:00: Techniques and strategies for big data. Rodriguez.