Saturday, September 12, 2015

Math of Data Science @ ICERM


Note: This is a re-post of the blog post I wrote originally on WordPress last month. It appears that Blogger is a much better platform for the kind of posts I'd like to make. Yes, I'm still learning ...

I recently attended the topical workshop on Mathematics in Data Sciences at ICERM. The attendance was good mix of students, postdocs, and researchers/faculty from academic institutions and national labs along with a sizable number of industry folks. The line of talks involved a similar mixture as well (abstracts/slides from many of the talks are available from the workshop page linked above). In particular from the industry side, there were talks from data scientists (or "engineers" with similar roles) from Ayasdi, LinkedIn (formerly), Netflix, New York Times, and Schlumberger-Doll, to name a few. Indeed, I found this diversity a direct indicator of the young age of the discipline in question, i.e., data science. And yes, the usual jokes about data science/big data were not spared, including the one about how big data is like teenage sex, how big data is very much a man's game since you usually hear men boasting "my data is bigger than yours", and how data scientists are mostly data janitors!

Coming from the academic side, what I found most interesting at this workshop were the panel (and open) discussion sessions. In the first such discussion, the group tried to come up with a (not so short!) list of topics that a program in data science should train the (undergraduate/masters) student in. After starting with the usual suspects such as calculus and linear algebra, probability and statistics, algorithms, machine learning, and databases, the group expanded the scope. Next came high dimensional geometry, information theory, data visualization/exploration, experimental design, and communication/business "skills". But many in the audience appeared to be surprised by the suggestion of a class on inverse problems, and electromagnetism (yes!). The topics and then associated skills to be taught soon filled up two large panels of white board (recall the "teenage sex" joke, any one?). To wrap up the session, it was suggested that the student be trained (at least) in Python, GitHub, Sql (or something similar), all from the point of view of industry readiness. As far as the mathematicians are concerned, it was suggested that they could start by making a wish list of all results (related to data science) one wants to see as theorems, and such a list will keep them busy for more than a life time. But to see any such effort make huge impact, one should ideally work with a domain expert. One particular subtopic of much importance in this context (no pun intended!) is that of textual data - very important for data science, and as yet not well explored by mathematicians.

The panel discussion about careers in data science was quite popular as well. A majority of the panelists were junior (read "young") data scientists from the industry, and were able to shed a lot of light on what a typical work day looks like for a data scientist. One aspect of their work that particularly appealed to me (coming from traditional academia) is how quickly and directly they are able to see the impact of their ideas and work. For instance, a data scientist in a social media company could brainstorm for 2 hours, write the code in 2 more hours, and see thousands of users enjoying the benefits before the end of the day! On the other hand, academicians often wait years, if not months, to just count citations of their papers.

If there was one take home message from the data scientists to (young and old) aspirants, it was to just play with data - of many types and from many sources, and not to worry so much about all the different classes/training (or proving theorems). Be ever-ready to dive into any data that you come across, manipulate/analyze it quickly, and get the first insights.

I'm not attempting to list any summary/thoughts on the Mathematics involved (as meant in the title of the workshop). The list of relevant Math/Stat/CS topics has been huge already, and is not getting any shorter in this era of data science. I doubt if we're going to precisely define what data science is any time soon!