Note: if you’re interested in learning more and building a simple WaveNet-style CNN time series model yourself using keras, check out the accompanying notebook that I’ve posted on github. For an introductory look at high-dimensional time series forecasting with neural networks, you can read my previous blog post.
Note: if you’re interested in building seq2seq time series models yourself using keras, check out the introductory notebook that I’ve posted on github.
The last few months I’ve been working on Porto Seguro’s Safe Driver Prediction Competition, and I’m thrilled to say that I finished in 18th place, snagging my first kaggle gold medal. This was the largest kaggle competition to date with ~5,200 teams competing, slightly more than the Santander Customer Satisfaction Competition. We were tasked with predicting the probability that a driver would file an insurance claim within a year of taking out a policy – a difficult binary classification problem. In this post I’ll describe the problem in more detail, walk through my overarching approach, and highlight some of the most interesting methods I used.
We just wrapped up our second bootcamp project, presenting on Friday after two weeks of crunch time. For this project we had to generate our own datasets with web scraping, and use these datasets to build predictive regression models. The course section kicked off by covering python scraping tools (BeautifulSoup, Selenium, and Scrapy), moving toward a focus on core regression techniques as we gathered workable data. We had opportunities to practice feature selection, model assumption testing, regularization, and cross validation. Best of all, we got to choose our own websites to target and define the scope of our project. I decided to see if I could predict how many views a statistics stack exchange question would get. The stack exchange website (“Cross Validated”) is where people go to ask whatever statistics questions they might have and get insight from the broad internet community – see below for an example of an on point question from RustyStatistician!
My first week at Metis Bootcamp has flown by, and so far the biggest challenge has been getting this blog set up (just kidding, but it definitely took me much longer than it should have). This week’s focus was getting up to speed on several of Python’s linchpin data manipulation and analysis packages like numpy, pandas, and matplotlib, with our workflow anchored by a project on MTA subway traffic data. The goal of the project was to help a hypothetical non-profit promoting women in Tech come up with a strategy for canvasing at NYC subway stations. We were given broad leeway to set assumptions about what to look for in a station and who we wanted to target, and could incorporate any data we thought would be useful in addition to the MTA’s official turnstile traffic data. Like many other teams, mine decided to cross reference the traffic data with neighborhood wealth data to aim for stations with high potential for both raising general awareness and finding valuable donors.