He Ate Hundreds of Burritos In the Name of (Data) Science! (Scott Cole) - KNN Ep.81

Ken Jee
Jan 11, 2022
36 min read

Updated: Feb 27, 2022

Today I had the pleasure of interviewing Scott Cole. Scott is technically a machine learning engineer in the fintech space, but stubbornly prefers a “data scientist” title. He studied bioengineering at Clemson University before moving to UC San Diego for grad school where he developed algorithms and python packages to analyze brain wave data. Scott then went through the Insight Data Science program, and for the last 2 years has been building algorithms for detecting fake accounts on a peer-to-peer payment app. Scott’s most known for eating hundreds of burritos during grad school and curating a small database of their qualities along 10 dimensions. In this episode we learn the origin of Scott's burrito dataset, how he narrowed his career search based on his interest, and how he is trying to create a data centric board game. I hope you enjoy the episode, I know I had a blast speaking with Scott.

Transcription:

[00:00:00] Scott: There's a lot of, like, if I was a data analyst for five years, I feel like there's just as many, if not more like really good skills that I would learn in that position compared to like a PhD. But I just have a certificate at the end of five years.

[00:00:26] Ken: This episode of Ken's Nearest Neighbors is powered by Z by HP. HP's high compute, workstation-grade line of products and solutions. Today I had the pleasure of interviewing Scott Cole. Scott is technically a machine learning engineer in the FinTech space, but he stubbornly prefers data scientist as a title. He studied bioengineering at Clemson University before moving to UC San Diego for grad school, where he developed algorithms and Python packages to analyze brainwave data.

Scott then went through the insight data science program. And for the last two years, he's been building algorithms for detecting fake accounts on a peer-to-peer payment app. Scott's most known for eating hundreds of burritos during grad school and curating a small database of all their qualities along 10 dimensions.

In this episode, we learned the origin of Scott's burrito dataset, how he narrowed his career search based on his interests and how he's trying to create a data centric board game. I hope you enjoy our episode. I know I had a blast speaking with Scott. Scott, thank you so much for coming on the Ken's Nearest Neighbors Podcast this week.

I'm really excited for this one. I obviously was introduced to you. I also heard and have analyzed your burrito dataset, which I think is very fascinating. And I just knew I had a, if someone put together that dataset. I figured that would probably be interesting to talk to on the show.

So again, thank you for coming in and how you doing today?

[00:01:51] Scott: Yeah. Yeah. Sweet. Thanks. Glad to be here. It's kind of funny how, like how much the burrito project I keep hearing about it years and years later, like stopped collecting burrito data for like two or three years, but the project lives on and it was a, it was a fun dataset.

You've left the burritos in San Francisco aren't as good as the ones in San Diego.

[00:02:16] Ken: Well, you know, I think that that's sort of the beauty of a data science project, right? Is it lives on beyond the time you've spent working on it, right. And also, essentially like advertises for you while you sleep or while you're doing something else.

Right. I mean, I think a lot of people are familiar with that dataset and probably familiar with you because of it. And like you did that work once again, like two years ago, and now you're still, you know, getting awareness, you're getting podcasts, interviews, you're getting whatever it might be that came out of that, which I think is like a really positive thing for just sharing your work and putting stuff that you're passionate about into the world.

[00:02:56] Scott: And I'm definitely hoping that some people are just finding good burritos in San Diego. Like, I am a little worried if someone might say like, Oh, I tried the number one burrito in your data set. And I thought it was a junk. But yeah, hopefully guiding people in the right direction too.

[00:03:14] Ken: Well, I am planning a trip to San Diego just to tour burritos based on the dataset.

So I'll, whenever I get out to do that, I will let you know.

[00:03:22] Scott: Yeah. I feel like I have to offer the caveat where the, so the number one rated burrito in the data. So it was the Valentine's. Whereas like my personal favorite is called the Taco Stand. So I'd recommend you to go to the Taco Stand. And I feel like there there's like an important caveat of the data where the data I was collecting out Valentine's tended to be tourists who were coming to San Diego.

So my friends were visiting who were in downtown, where Valentine's is. They often, it was like their first San Diego burrito. So they had like maybe low expectations or low standards I should say. So I feel like that might have artificially inflated Valentine's ratings. Anyway, there that's just like one caveat of the, of the data.

[00:04:18] Ken: Well, I base all my burrito decisions on the purely quantitative metrics. So I was only looking at the mass and the length of the burritos.

[00:04:31] Scott: It was fun. Just whipping out a tape measure and seeing people give me weird looks.

[00:04:37] Ken: I love that. Yeah, I wish you'd brought like a scale to everyone that would have been some next level stuff.

I mean, they make pocket scales now for the next when you start measuring all the burgers or something in San Francisco or all the polka bowls you can start using the scale.

[00:04:57] Scott: Yeah. One of my friends, actually a couple of years into the project gave me a scale for my birthday. And so I, for a while, try to figure out how to balance a burrito on a scale, but it's pretty difficult.

[00:05:10] Ken: That is hilarious. So in order to, I mean, obviously I think that was a pretty, pretty good way to get acquainted with you, but for the listeners who are familiar with like the normal structure of the interview, how did you first get interested in technology and data? Was there just a pivotal moment where you're like, Hey, this is what I want to do and what I'm interested in, or was it more of a slow progression?

[00:05:35] Scott: I felt like it was a bit of stumbling around and undergrad, I started off sort of on like a wet lab track where I wasn't really sure what I wanted to do. I was like, Oh, I like science and math. So I you know, was doing basically encouraged to like, Oh, start doing research. And I didn't really know what research was at the time, but I was playing around with some chemicals.

Now not ended up not being a huge fan of it. Like I felt like, I don't know, I wasn't, my experiments would often like get contaminated from bacteria or something. And I didn't feel like I had a lot of control over it. So what I found that I liked more was like a signal processing class that I took where it was pretty much saying like, Oh, we have this complicated signal.

Can we deconstruct what that means? And I had also been like looking into brain data at the time. And when I saw the signals in the class, it reminded me of like, Oh, there are these brainwaves that people see when they record from the brain and. I think it'd be so cool if we were able to like, understand what that meant.

So I feel like brainwave data. That's what initially got me interested into data science in general. Just like thinking, Oh, if we could understand what these noisy signals are, then we could maybe enhance our brains or treat disease or something really cool.

[00:07:15] Ken: That's awesome. I've been getting a lot more interested in the different brainwave data types.

Well, I guess it's all the same. The different frequencies is probably more app way to put it. We're reading this book called altered traits and it's about the science of meditation and like what's true and what's not true. And evaluating like the brains of like hardcore meditators or people that practice with different protocols.

And it's pretty fascinating. A lot of it is just like debunking that like, Hey, meditation is supposed to do this, but it actually does. But the real science that they've uncovered is also pretty fascinating of like these Tibetan monks that they bring in and how different their brains work than someone who's on their phone or computer all the time is pretty wild.

But so obviously you took that a step further and you studied within that domain for quite a period of time. Do you mind kind of walking me through you know, from that moment of realizing this is interesting, where did that take you or how did that provide some directection.

[00:08:20] Scott: Yeah. Yeah, I wasn't too sure.

Or what to do after undergrad, other than go to grad school getting a real job seemed pretty scary. So I feel like I was on the track of like, Okay, I'm in research. I can keep doing more school. So I applied to bioengineering and neuroscience grad programs kind of with this dream of like, Oh, I can spend my days analyzing brain beta and understanding what it means.

So I yeah. I ended up in San Diego and was... in the first year starting to do experiments and rotate between a few labs to figure out what I wanted to do. And the experiment station part was really challenging for me. Like we had a bootcamp at the beginning where we had to record from leeches and I was just so helpless and trying to collect my own data on my first rotation was like I attempted to do some surgeries to implant electrodes into an animal's brain.

And I had to get a lot of help from the senior grad students. So I quickly realized how helpless I was in collecting my own data. So I was like, so I rotate in to other labs that had the, like no cheat code of grad school where I don't have to collect my own data. Rely on collaborators who send us data and we just get to analyze it.

And, but that sounded a lot more fun. So I ended up routine and like a MATLAB lab, and a Python lab, and ended up choosing the Python lab. Partly not so much for the science itself, but just thinking like, Oh, if I want to want to become a data scientist after graduating, after I realized how helpless I was in collecting my own data. It sounds like it's useful to know Python.

[00:10:28] Ken: Well, it's also, I mean, that obviously sounds like you were interested or aware of data science relatively early on. And, I mean, it sounds like you weren't interested in necessarily pursuing academia that far. Why did you decide to just like stick it out through all the research rather than getting, finishing the PhD rather than just jumping out early.

[00:10:51] Scott: Yeah. It's part of just being intimidated of what is outside of grad school. I trying to find a real job. Just scary pop partially. I'm partially like also just being, I don't know, still still genuinely interested in seeing if I can discover something new about the brain, but it was actually funny in one of my interviews before I even started grad school, a senior grad student was giving a talk about how he realized that what he was, he was passionate about was not actually the research.

It was actually data science and I watched his talk and I was like, Oh crap, man, maybe that's me. Maybe I shouldn't be doing this. But I suppressed that thought a little until I got to bootcamp. And then. As I mentioned, the failed experimentation reminded me like, Oh this, this is difficult. I may want another route.

And I feel like I got pretty lucky in the lab. I ended up joining Brad FOI tech. He was actually a data scientist at Uber before he became a professor. So our our lab was very like data science oriented. And we had this eccentric postdoc who would yell at me for how bad my code was. In a way to help build up our skill and really give us more options out like after graduating.

And I'm very thankful for that.

[00:12:24] Ken: Well, you know, I really like that, you know, you kind of stuck through it because you enjoy the pursuit of like information and knowledge. That's something that I really I wish I had appreciated more when I was going through school or when I was doing some of these things.

I mean, it took me to like the second grad program I did to realize that, Hey, like I'm doing this because I like am fascinated with learning, not just a means to an end. And I think if a lot of people took that approach with their education more, they'd get a lot more out of it. Right. It's like, Oh, I'd have to get to school, get through school to get a job.

Actually. There's not, it's like, no, there's a lot of things that you can do during school that will be very meaningful in a job market or whatever it might be. But you can also find like utility in them. You can also put your own fun, spin on them. You can also like, enjoy that process. And I just get really disappointed when I look back and I see that my own like transactional nature with education, but then I see other people like trying to do a master's in data science or something like that.

What's like, are they really going to learn data science? Are they going to land a job? Right. And it's like, you can do both and you can have fun with both. But unfortunately I think a lot of people are going the other route where it's just like, Hey, I know this is like something that will be helpful in a job search.

I'm just going to do it to put it, to check that off my lesson. Like that's a lot of time, you know.

[00:13:52] Scott: I feel like I wouldn't have made it if I got felt fairly lucky in that like, Oh, I got to spend my days analyzing data the way I want to analyze data. So that, that freedom definitely helped me keep going.

[00:14:12] Ken: I love that. And so how did you make this eventual transition from academia to data science? I think that that's something a lot of people probably have questions about.

[00:14:26] Scott: Yeah, my mind shift was pretty quicker early on. I wanted to do an internship to get a sense of like, Oh, what is a work like outside of academia?

And I did a short one about three months but the place called crime lab in New York where it wasn't like a company or anything. It was a small organization, actually part of university of Chicago, but it partnered with local city governments in order to like, analyze data for them. And my project that I got to work on was analyzing police officer behavior and and trying to predict like, Okay, which officers are most likely to get an excessive force.

And then those are the officers that should get the escalation training. So the department wanted to allocate those resources really effectively. So it was a fun, a cool, like data science opportunity and doing that made me pretty sure, like, Okay, there are a lot of cool questions out there beyond neuroscience.

[00:15:49] Ken: It's a pretty relevant one, pretty relevant on these. Did you, did you do that like while you were doing your PhD or did you take a break to do that, or what did that look like?

[00:15:59] Scott: Yeah, I took a break. I was sick of looking at the same, like rats data for almost a year. So took a few months off and then went back to finish my PhD in another year.

And then moved up to San Francisco to find a job.

[00:16:19] Ken: Excellent. And so what was your first. Well, like what, after that, how did you actually land that first job? You know, did you know, it was like, I think we talked like a bootcamp was involved. What what that process look like.

[00:16:35] Scott: Right, right.

So yeah, I had heard this bootcamp advertised called Insight, which is, takes a bunch of grad students disgruntled with academia and first of all, in a room together and it says that all of them want a data science job. So I did that program, which was eight weeks. They basically say everyone, You need to work on the project.

And. We, we worked together, which is a really cool atmosphere to be in just a bunch of people on the same situation. Not really sure like what data sciences yet but really, really hoping to get a job insight brought in data scientists from a few dozen companies to tell us what their work was like to give us more of a sense of, Oh, this is what working at a startup was like, or a FinTech company or a consultancy or something like that, which really opened my eyes to the possibilities before that. I kind of only knew if like, Oh, I could be a data scientist at big, some big name company. I didn't really know much about you know, other options.

So In that bootcamp we were pretty a lot of us just really eager to get any sort of job interviewing for months is not fun. And we all commiserate with each other. So after that, I, my first data science job was at a TV analytics company that I found through insight about, well, it was, that was cool.

I really enjoyed the team I worked with, but after a few months sort of figured out that I wasn't really passionate about what I was doing like organizing TV data in order to make ad targeting better. And I started thinking like, Oh, wow, like what else could be out there that, that was actually triggered by my manager at the time.

She decided to leave. So when she left, I was like, Oh, maybe, maybe maybe I'm not as into this either. And I was really lucky that one of the guys I had met when I moved to San Francisco was worked in fraud detection for a FinTech company and was like, Hey we have open positions.

You should, you should come help us fight for all those thoughts. Sounds fun. So yeah, after a few months I jumped to there and I've been doing fraud detection versus.

[00:19:48] Ken: That's awesome. Well, you told me offline that you probably never saw yourself working in FinTech, but you fell, but you've kind of fell in love with fraud detection and that type of problem.

And how do you find problems that you're interested in solving? What does that process like? Or like, you know, how do you go about finding things that are interesting to you? You know, you recognize that, that the work of the TV analytics company didn't really scratch your itch, but how do you go about again, finding those, those areas that you do find interesting?

[00:20:17] Scott: Yeah. I'd say it's not a conscious thing I do, but more just like thinking throughout my life, what are the things that I enjoy and identifying like, Oh, there's a job opportunity there. So I've always felt like I enjoy. Like seeing systems gamed or hacks slightly. So like creating, like I've created like fake profiles on websites and just like found that process for me.

So the concept of someone like making fake bank accounts and stealing money kind of like seemed like a game in some sense of like, Oh, can I like out smart this person who's trying to like trick the system. I had heard a talk in grad school if someone who did fraud detection and I was just like, Oh, that sounds fascinating.

Another thing that I am interested in is like online dating data science, just because like, I've felt like I've personally heavily relied on online dating to meet people because meeting people in real life can be difficult. And I really liked that actually now. Exactly. I really liked the idea of AirBnb like a database with these like properties of people who like may end up being like really good matches with one another.

I think that that is likely to like, build a lot, like potentially a lot better relationships than just like whoever I can randomly meet throughout life. So like that is one thing I'm thinking of, like, Oh, if I could contribute to the algorithms that help people meet each other, like that's another area that I'm really interested in, but I, yeah, I don't think that there's any like systematic way identify them, but just sort of keeping my eye out for what might mesh with my interests.

[00:22:24] Ken: I really liked that. I think again, we talked offline and one thing you mentioned to me is that you started looking for positions after the TV company based on what your interests were, not the other way around you. Weren't just looking for another job. We were looking at a position that would be able to in one of those spaces and it just so happened.

One of them was at a, a FinTech company, right. And this opportunity came about. Can you talk about maybe the importance of that for you and how that may be? Might've been a little bit different than your first experience when you were just trying to land a job.

[00:23:00] Scott: Yeah. I think part of my motivation is a laziness in some way where when I first wanted a job at like, I just wanted to stop studying and I was just like, I would just want this interview process to be done and find like something that's reasonable where like one of my top priorities, along with something I'm interested in is just having like a team that I'm uncomfortable with them working with nice people like that will at least like make my life, you know, fairly happy.

Whereas once I had a job, my laziness becomes like, Okay, I don't want to enter with so much. I don't want to like look for so many different companies. I want to really limit my scope to like two positions for them, especially interested in them. So yeah, that's why I was figuring, okay. I only want to look at fraud detection companies right now, and if they both rejected me. Then finally, once I work up the energy to apply to more companies that I will look into something more.

[00:24:03] Ken: Well, you know, it's interesting. It's a really important point. I think you brought up in terms of like efficiency and like sort of laziness is that a lot of people think it's most efficient to just apply to a bunch of companies and see what hits, right.

It's like, Oh, you know, I can send out this generalized resonate, whatever it may be. I actually think it's the opposite. I think if you're applying very specifically in a niche one the projects you do, or the way you prepare, it can be a lot more specialized, which gives you a significantly higher chance of getting an interview or lending the job.

But to like, you know, you're able to like hone in. And I essentially work less than trying to please everyone. You're just trying to please a couple in this specific space and you can get so much more specific with your language and your resume and these types of things that I think it's actually, as you mentioned, like way more efficient to do that.

Like figure out exactly what you want and cater everything to give you the best chance at those fewer positions than to just spread yourself so thin, because then like, just as you described, you have to prepare for all these different types of interviews. You have to do XYZ. I would imagine that fraud detection interviews, right.

They're probably more similar to other fraud detection interviews, then like a broad generalist data science interview. Right. And so, if you know that you can prepare for those things, right.

[00:25:25] Scott: I think in the future it would be actually in the interviews I ended up doing for fraud detection.

It was just a general like data science pipeline. But I think as like someone gets more senior and like specialized in fraud detection, it would probably be that way. I think one thing that's really important to couple with like applying to my interests is also applying to like, what is like available and heavily recruiting.

Like I had mentioned probably what my first choice was actually it was like working in algorithms for all my dating like I, for a passionate cover letter and applied many times to coffee meets bagel a dating app to work for them, but they were just never looking to hire like a data scientist in my life, but something that I wanted to, they were more looking for data engineers.

So no matter like how much I bothered their employees with how interested I am in working with that. I don't think ever going to hire me. So yeah. In addition to like, know, finding what we're interested in also like keeping our ear out for okay. What companies like are like specifically hiring.

[00:26:47] Ken: Yeah.

Well, I mean, there's obviously a lot of constraints like geography, right? I mean, you could work at a, I can't remember where some of the other ones are. I think one of them, one of the other dating apps is in, based in New York. Right. And it's like, well, if he didn't, if he didn't want to go work in New York, that you'd probably write that one off and and, not try to apply there.

And it's, there's deceptively easy ways to narrow the search that make it like more manageable for you to do and give you, in my mind, significantly higher odds. I would like to ask and get into what you find. So I obviously you described why you're so interested in fraud detection, but getting into more of the weeds about fraud detection, I think could be pretty interesting.

You described it as it's like a very interesting space with constraints that I hadn't considered. Do you mind just going into that just a little bit?

[00:27:38] Scott: Yeah. So when I was first like realized I was interested in fraud detection. Like you were saying, I didn't really know exactly what I was getting into when I was Googling online of like fraud detection, data science, most of what I was able to find or saying like, Oh, you need to do anomaly detection and you need to know how to like work with imbalanced data.

And I really struggled to find like more descriptions other than that. So when I was interviewing and I was saying, Oh, I'm really interested in the fraud detection. There's part of me. That's like, do I know what I'm getting myself into? And yeah, over the past couple of years I've been working on it.

There's definitely been like a lot of cool, like realizations and considerations that yeah, that I've realized that I wasn't aware of before. I feel like the principal one is the evolutionary nature of fraud, where once we put out a model to stop fraud, it's going to find a way around that.

Pretty like it's a huge challenge to Michigan. My name is just no matter how, like how fancy our model is, if the training data is not representative of the evaluation data, then it's not going to perform very well. And so a lot, there's a lot of considerations that we need to like, keep in mind that knowing like, Oh, the fraudsters are going to be reacting to these models.

[00:29:17] Ken: Sorry. How do you design a model that responds quickly? So like I'm doing a video this week about what happened with Zillow, right? Where essentially there's a lot of problems with what they did, but you know, the pandemic came and their models were not apt to handle what's happening in a global pandemic.

Like the changes in the real estate market happened more quickly. Then previous times, and like, I think it's their fault for not pulling the plug, but at the same time, like their model was not designed to rapidly adjust and iterate quickly like that. What are some things that and digital transactions are different, like scope, right?

Like they do happen quickly. You do have more data, but what can people do to like make sure their models are up to date in a reasonable time, aside from just like constantly retraining? Like how do you balance that, that past?

[00:30:10] Scott: Yeah, I've watched a few talks online. You get a sense of like what different companies do and what you're saying.

Model retraining seems to be the most common solution where just every, maybe every day, every week, every month retraining training the model. So we have the most up-to-date information and also another approach that I've heard mentioned is like looking out for anomalies. So is there like some new suspicious behavior that a company models are not picking up on?

Like, if they're able, if they have it at like a complimentary anomaly detection framework then that can help them like realize quickly, Oh, this is how how the fraudsters are getting around their models.

[00:31:09] Ken: That's awesome. And is fraud detection like an almost entirely automated task or is it like a hyper task where it's like they're humans evaluating things as well?

[00:31:21] Scott: I'd say the humans are very important and that's how a lot of fraud detection has been traditionally done before machine learning or even data analysis was a lot of manual reviews because it can be like really tricky and ambiguous, like what is fraud and what is not fraud? Like how, how do we craft our labels?

And yeah, that's a very, very non-trivial problem because how to get those labels. So for example, and and financial transaction data, there's the concept of a chargeback where someone reports that transaction as fraud. But not all of those, like people are dishonest sometimes. Maybe I regret spending $500.

So I'll call my bank and say, Ah, no, that was fraud. So filtering out out things like that, is something that a human can be a lot better at than algorithm or else or that in the future we'll work on making like the industry as a whole or work on making the algorithms better for that.

[00:32:38] Ken: This episode of Ken's Nearest Neighbors is brought to you by Z by HP. HP's high compute, workstation-grade line of products and solutions. Z is specifically made for high performance data science solutions. And I personally use the ZBook Studio and the Z4 Workstation. I really love that the Z line who comes standard with Linux and they also can be configured with the data science software stack. With the software stack, you can get right into the work of doing data science on day 1 without the overhead of having to completely reconfigure your new machine. Now back to our show. So, you know, you talked a little bit about being people, people lying about their transaction history cause they're embarrassed. And you know, there's some of the malicious behavior that, that goes on with fraud.

You told me that you were pretty passionate also about people lying with statistics. Right. And I'd love to hear more about how people do that when that happens. And what are the repercussions associated with that?

[00:33:37] Scott: Yeah, one of my favorite books was I think it was written in 1952 with the title, like how to lie with statistics.

That was the fun, like mostly picture book where the sky is basically ripping into the papers written in the 1940s and 1950s who published charts that are very misleading. And. In order to make, maybe sell something or push an agenda. And he goes into like the different ways to spot this or call out.

So a few examples being like misleading axes on a graph. If you don't start at zero, then you can make a small change, look really big. And like, yeah. Catch people off guard too. Who may not know no to like, Oh, watch out for that. And also just like not making limited sample sizes, very transparent.

You can say like, Oh, there was a 33% increase. And then the efficacy of this drug, but could you run it on like three subjects? So it's a really cool book to remind our attention, like push our attention to those caveats that that I think are really useful to have in mind when evaluating data critically, actually I think before reading that book, I was like first started thinking about this in terms of scientific data.

So in grad school like I, when I was first starting out as a grad student without much context, like and even now when like broaching into field, like I kind of have to take on faith, what I'm reading, because I don't really have the expertise in order to know like, Oh was this data analyzed appropriately?

Or was it could it have been like miss analyzed in some way? And I found myself in a few instances in grad school, I am talking to a few lab mates where we're talking about a paper maybe with the author of that paper. And saying like, Oh like going through the process of their analysis at some point, finding out that there is like some error in their analysis, the results are not valid.

And yeah, those are really touchy, like understandably, like we all make mistakes. Like I don't judge those people too harshly. But it's really weird to see like, Oh, this is. Peer reviewed, published paper. But the, like, it was only a once we got to a very, like very technical depth. One of the only few people who also like, can critique this method before we realized like, Oh, this conclusion, or was not accurate because the analysis underlying it was not accurate.

So I was like weird in that once we know that, like no one else knows that that paper does get retracted because it's like really embarrassing to retract a paper. And so those papers just keep going.

[00:37:39] Ken: But yeah, I mean, something that I've become recently familiar with, particularly in academia is the p-value hacking.

So essentially yeah. We evaluate a hundred different variables associated with an outcome we want and like probabilistically, like five of those will signal as significant. Right? And it's like, Hey, we're like manipulating statistics to find something significant. Even though like, by definition in that circumstance, like for significance, we would expect like randomly five, right?

And to me, that's like, that's a little bit of a problem with the way we evaluate our models in general. I mean, I think most data scientists in industry pay significantly less attention to p-value than people in academia. Because for better, for worse in an industry, you like, you gotta test things significantly more.

Right. You get to. Rely primarily on the results and outcomes, rather than if like specific variables are relevant. But I mean, that is absolutely being done. I've seen, I've done an experiments and they're like, Oh, like this gives me a little bit more pause. But you know, there's like layers of sophistication without people who mind fly with statistics and it, and it gets scary.

And like, you know, like a non-data scientist, probably even a researcher off the street wouldn't think about, okay. That is a way to manipulate and find something significant in this analysis when there might be absolutely nothing there.

[00:39:19] Scott: Yeah. I feel like there are a lot of intersecting issues here. Like one being the incentives, like we're just like us as researchers.

Like, we really want our results to be significant so we can publish papers and just make our lives easier. So you like intentionally or like often unintentionally, like things like p-hacking.

[00:39:41] Ken: Yeah. It's you want to find something if you're passionate about this research, right? Like you, most people aren't researching something that they don't see promise in.

Right. You know, like let's say I'm like crazy about some diet model. And I do research in it, like there's that built in confirmation bias too, right? I mean, you'd hope that there wouldn't be, and there's some, some scientists that like, are like, screw this. I'm not interested in this, but I want to prove people wrong or I want to see what an objective truth is.

It's hard to spend so much time not being interested in something or studying.

[00:40:17] Scott: Totally. I do want to mention though that there is like definitely good progress and movements like to get around this. So yeah, during grad school I saw more and more labs getting involved in publishing their data.

So it was open access in our lab, like we along with every paper, have a get hub repo that contains Jupyter notebooks that can regenerate all of the figures. And like, there's some scary part of that where it's like, Oh, my code is out there. Someone could find the bug that I have in my code because like not everything we've read this perfect. But that's how it's a little scary, but it's being, it's becoming a lot more commonplace and that's like, that's really good to see. So I'm, that gives me like more hope in academia and research in the future. Like the more the more this like open access approach gets amplified.

[00:41:20] Ken: Into that point, I think, historically, one of the challenges with academia is that it was sort of this like ivory tower type of thing, right? It's like, Oh, there are these PhDs in this closet somewhere. And like, they're the only ones qualified to work on this problem. I think now, especially with more of a like web 3.0 open source nature of things, it's like really beautiful that like, I can go work on or try and can help and contribute on one of these projects.

And there's like infrastructure in place where people can check it. And like, if I'm doing good work, great, if I'm not doing good work, like you don't have to merge any of it. Right. But like, there's, there's such cool value in like hobbyists or people that are just genuinely caring about a problem, doing good analysis and and, finding insight.

Like, I would argue that, you know, a room of 200, like fairly educated, pretty interesting people working on a problem is better than like two PhDs working on it. Right. Assuming you could organize those 200 people, right? Because you'd bring grace, introspectives the different, different types of things.

It's like a pretty beautiful and elegant solution that we're coming to with a lot of the open source problems that we're facing. So I would a hundred percent agree with you. I'm very optimistic about the future of research and information. And hopefully I'll be a part of more of it.

They can taking a break from doing as much only projects, but back on the grind soon before I talked to you about sliced, I do want to get your sentiment about having a PhD in the workplace. You know, you've obviously gone through that process. Do you think it's been helpful on the job? Do you think it's something that, that is worth pursuing for someone that's interested in data science, but not necessarily research?

What is your general sentiment around around going all the way and, you know, pursuing a doctorate?

[00:43:25] Scott: Yeah, I definitely wouldn't encourage it to someone who's like. Yeah. Just as a means to progress within data science. But it is I feel like it's overvalued and in some sense, a PhD where I've heard some friends like, Help me like, oh, I'm joining this team and everyone else has a PhD. Like I'm very intimidated by that. Or like another friend was talking about how like hiring someone for a specific positions. I don't know a PhD is required for the, some like I felt like that was a little weird because I don't feel like the PhD experiences like necessarily like the only way to like, yeah.

Be or like to be qualified is unique. But I questioned that a bit where I think that if just because a position has like a research component, there's a lot of like if I was a data analyst for five years, I feel like there's just as many, if not more like really good skills that I would learn in that position compared to like a PhD, but I just have a certificate at the end of five years.

I think that I personally when thinking about like skills and knowledge that are useful to becoming a data scientist But some people may think like, Oh, a PhD is especially useful, but I don't think that's necessarily the case.

[00:45:33] Ken: Yeah, I mean, from a lot of the people I've talked to, it seems like there is a little bit of almost unlearning you have to do coming from like academia for a long time is that there are like specific systems in place.

There are specific ways that you do things in an educational setting that can be very different from what you do in a more like a professional business, private sector setting. And I think. I feel like a lot of the things that you pick up a really good, right? Like a PhD signals that someone can dedicate four or five years of their life.

So something and like really focus on a problem and solving it. That's a good thing for employers to signal, but the level of structure in most PhD programs is probably like very different. And like, you have, you don't have to go for the most part with Slack, Slack messages every five minutes, and then these different types of non dedicated work that you see in the workplace.

So I'm interested in your opinion on that duty, did you feel like you had to like undo some stuff and like to really come into your own in a more traditional data central.

[00:46:38] Scott: A bit, yeah, there's definitely, I think. Some senses where I feel like I need to understand the whole stack of what I'm working on.

But in some cases, in a data science role is less practical where I'm in grad school. I was used to like being the data engineer in some sense of like process making a data pipeline to do this and just being able to be self-sufficient where as like, I see like being a data scientist on some teams kind of require like a lot of passing off and ownership in ways that I wasn't used to.

So for example Like I've had friends on teams who like, as a data scientist, they would request for a like machine learning engineer to build a future for their model. And then they would have to wait for like two months in order for that feature to be available for them. Or like immensely frustrated me.

I was like, I really want to like, have like access to like that position to in order to do that. And from like the data pipeline, since I was in the position at a job where I was told, like, Okay, here is our like data pipeline, but you don't have to worry about anything before this. But I had a lot of questions about like, how was that table generated?

And we ended up having issues in that there were like differences in expectation between like what the data science team was assuming versus what, like the like upstream infrastructure team was actually producing. So I would say like, I've worked to have like more of a level of comfort in terms of like being, yeah.

Being more comfortable in in like passing off work than I used to be, but also maybe like about to consideration and like finding the positions that I'm most interested in. Like I want to make sure I want to be in a role where like I am doing like some of the analytics and have like clear visibility into the data process.

As I'm doing data science modeling work, I'd be hesitant to, or to take a position. If I know in advance that I'm going to be like blind in a lot of aspects.

[00:49:31] Ken: I really liked the selectiveness there. I think that that's something really important that it goes maybe a lump, a level deeper than a lot of people think a lot of people are like, okay, how much money am I going to make?

What is the company name? What are the perks that I get? They're not thinking enough about the specific role and I've been guilty of that too. Like, I took a job a while ago where like everything lined up, it was a really good opportunity, but I got into the work and I was like, I don't like any of this work that I'm doing, like what, like, this does not cater to my strengths.

This data is not I should have done more vetting with the types of things that matter for the work rather than for the things that matter, like outside of work and for the enjoyment. And I think that that's a really important thing to look at and to tease out in interviews or in any of these things is like, what is the job going to be like, how much control am I going to have?

How much help them I am going to have?

[00:50:28] Scott: It is a really difficult to tease apart though. So I think that's fair. Like often I don't really know what I'm getting into until yeah. Until I'm there. And then I guess hoping there's some flexibility we can roll.

[00:50:43] Ken: So is that something that you like try and optimize for is like how much control over the position that you might have, or like seemingly so.

[00:50:51] Scott: Yeah, I feel like that'd be really difficult to optimize for. So I'm not sure yet. I feel like I've gotten lucky and like my experiences so far and being home, having managers white mesh well with.

[00:51:09] Ken: Awesome. Well, I think we've talked enough about work. I want to ask one more topic and that is on Sliced.

So you obviously competed in this last season. I watched a lot of the episodes. It's kind of a weird time for me in Hawaii, so I didn't get to finish them all, but I caught up a lot on the YouTube channel to that next one, putting out clips on. And I'd love to hear what your experience was with that, you know, like what were the, what was like a big takeaway?

Was there any know how'd you prepare those types of things.

[00:51:39] Scott: Yeah. Yeah. So I was on season one, episode one and like I was getting ready for, I think it was 6:00 PM Pacific time. Got that, got a nice cold brew. Before Nick and Meg sent us a Kaggle link and said like, okay, start, you have two hours to build a model, like some data visualizations.

And I try to just have fun with it primarily. And that's one thing that like, Nick really highlighted, like though people can like pick up like some like some useful data science like concepts and tools, but by watching other people also a big thing is just to have fun with it. There are people who trust up in costumes.

I wasn't that creative. But just having like, I've never had that competitive feeling, like racing to analyze the data set before. So it was a bit stressful. But, but reminding myself through the process of like, okay just think of this as like, Oh, this isn't a new data set that I just get to explore for a couple of hours, but is like one of the things that yeah, I would like to do if I had more time, just the process of like exploring a new data site is pretty fun. It's like a game.

[00:53:25] Ken: That's awesome. So, do you think the time limit was appropriate? Would you have, what do you think is optimal maybe like three hours or was it like a pretty good pretty good.

[00:53:36] Scott: I think two, two is definitely good. More than that, I would start to get tired even maybe like, I feel like maybe I have a shorter attention span than most people, but after like an hour and a half, I was kind of starting to check out and I'd want to move around a lot.

So the first part I would focus on like processing data and then do some modeling. But I think in both episodes, I was in, the last 30 minutes, I was like, okay. I just want to make some plots right now. There is like a data visualization aspect of the judgment but that motivates us to switch around with it.

But I feel like I was pretty drained at the like halfway through and and especially at the end yeah, more time would be good. I, but I feel like also last time would be a bit too rushed, so I hope they stick with with the two hours for the future seasons.

[00:54:36] Ken: Did you do any preparation beforehand?

How did you practice? How did you get tuned up for the actual competition?

[00:54:43] Scott: Yeah, there was a season zero that they had the datasets released on Kaggle too, so I went through some of those data sets on trying to discipline myself to do a mock run. That's like, okay, if I just receive this, let me sprint through training a model quick.

So yeah, I feel like now there were like 12 episodes in the first season. People want to like have that experience. There's a lot of datasets for them to go through and I feel like that's one thing, but down the line, I think I'll end up going through several of those data sets. Just, just for fun to explore.

[00:55:25] Ken: Do you think that that is an effective way to learn, to give yourself time, pressure like that? I mean, I would imagine there are some benefits, but you know, it might not be completely dissimilar to some interview processes.

[00:55:40] Scott: Totally. Yeah, I would say it's a really nice way to learn in that. Like each new data set.

I dive into, there's usually like some method of processing. I need to look up. So I might learn a new tool in pandas or I might run into own processes, texts in a certain way. So I feel like it's a good exercise and like discovering some new tools or freshening up on some things I haven't used in awhile.

So maybe not so much the time pressure part, but more just the novelty. Yeah. Repeated novelty of jumping into datasets and also looking at examples of what other people have done. Like, I feel like especially as I first started analyzing data, like the things that helped me most were reading other people's code and discovering the ways they're doing them and the tools they're using that I wasn't familiar with.

So you know, watching the watching the YouTube videos or finding the notebooks that some of the contestants. Put up to see the exact functions that they use. Like, I know that some of the contestants were really crazy about an algorithm called cat boost, which I haven't dove into yet, but that's one thing that seemed to work really well. So I needed to try that all at some point.

[00:57:10] Ken: Awesome. Yeah. I completely agree. I think code review is something that I didn't start doing until more recently, like the last couple of years. And I think that that's made me a better thinker, right? Because you can kind of jump into someone else's brain for 30 minutes, 40 minutes, see what they were thinking.

And then you're like, Oh, why didn't I approach it that way? You know, that, that that's one of the quickest ways to be able to like simulate or try a bunch of different approaches that you might not have traditionally done. And, you know, I really love that methodology. I really like doing that I think I even made a video on it because I was so excited about it.

But absolutely I'm glad you had a good experience on slice. I'm excited to watch the next season. I told Nick I might even compete in probably season three. So yeah, I'll probably be re giving you a ring for any, any tips and tricks to improve my performance there. Maybe you can distract some of the contestants as well.

So you know, Scott, I really enjoyed this. I don't have any more questions for you. How, how can people you know, find you what's the best way to reach out. And are there any projects you're working on right now that you'd like to share about?

[00:58:20] Scott: Yeah, I think LinkedIn is probably the easiest way to get in touch from there.

There's a link to my website, which has my email address, and definitely happy to talk to anyone who wants to chat about like fraud detection or data science in general. Definitely reach out if you want to. As for projects, related to one thing we were talking about, I gave my first crack at board game design making a board game called P hacking where it's a competitive game between two research labs who are both manipulating the same dataset.

And it's a card based game where basically like I will get to manipulate one data point and then you get to manipulate one data point. And there are certain cards that allow us to like, Oh, make a rounding error or like swap the labels of the datasets. So I'm actually pretty happy with it.

Like, I feel like the mechanism is pretty like works pretty well and that. At the end of the game, like I am trying to get a population A, to be significantly higher than population B and you're trying to do the opposite. So there's this like yeah, this we're we're adversarial and like I'm trying to increase some samples and they're trying to prevent them from getting too high and we're planning out like, Oh, which actions should we use sooner rather than later?

So yeah, I would encourage people to the checkout that I made a YouTube video.

[01:00:11] Ken: Shoot that over. I will link it 100%, that sounds so fun. And so is it two players or can you have more than that? Or what are we looking at?

[01:00:20] Scott: Yeah, so I did make a mock game man wool. So two players would be most natural, but I think it can be extended to four where it's like two teams of two.

And then you basically alternate terms. I'm still iterating on like the specifics. I feel like I need to wait until I like take a sabbatical from one job to like focus on it full time. But yeah, there's a website, like open cards dot io, or something. And it's, it's linked on my, on my website, but you can actually go in and play like a mock rendition of the game.

So would love if people tried it out gave some feedback or just, yeah. Others interested about P Hacking.

[01:01:10] Ken: Yeah, absolutely. I will remind me, I'll make sure I link in the description. I will go check it out. That sounds fascinating to me. I personally quite enjoy board games and card games. So I've been having a lot of fun night.

I know I think Nick did Catan in tournament a while ago and It was before I really knew him well, so I would love to get into, into the circuit and do some Catana Lytics. So same. Excellent. Well, thank you so much for your time, Scott. I really enjoyed this. I'm super excited for this to come out.

You know, I want to check out the board game. I also want to continue to have some more conversations with you down the road. So thank you again.

He Ate Hundreds of Burritos In the Name of (Data) Science! (Scott Cole) - KNN Ep.81

Recent Posts

Comments