Ep. 18 Kavita Ganesan - Software Engineer turned NLP Data Scientist


Kavita Ganesan is a data scientist with expertise in natural language processing, machine learning, text mining and search. Over the last decade, she’s developed AI and data science solutions for companies like GitHub3M Health Information Systems and eBay as well as a handful of startups. Kavita also serves as a consultant for several companies.

For other tech roles and descriptions click here.

Kavita’s expertise is zeroing in on business goals and coming up with robust solutions to hard data problems that stand the test of time. Her work has involved:

  • Extracting insights from unstructured data
  • Designing and developing customized recommendation systems
  • Sentiment analysis
  • Developing high-accuracy text classifiers
  • Search enhancements
  • Text summarization

Kavita received her Ph.D. in computer science with a focus on text mining, machine learning and search from the University of Illinois at Urbana Champaign. She’s authored over ten first author papers at top tier data mining and NLP publications such as WWW, COLING, NAACL, IEEE Big Data and Information Retrieval Journal.

Episode Summary

Kavita Ganesan took the traditional 4 year approach to getting a bachelor’s in computer science, but her journey didn’t stop there. After undergrad, she switched back and forth between working and more schooling before ultimately getting her PhD in data science in 2013.

After working as an engineer for some time, Kavita found her passion in natural language processing (NLP). She quickly realized that breaking into the field would require her to be a research scientist, not a software engineer. This is when she went back to school and got her PhD.

In this episode we’ll get into the day to day of a research scientist. “You don’t always have to build up sophisticated models”, Kavita says, but you do “need to have your creative hat on at all times.” And for those that think data science is just about complex math, you’ll be surprised! According to Kavita, she spends only .5 percent of her time on math “because the math is already done for you in the tools.”

In this episode we’ll also cover:

  1. What its like working as a data analyst in healthcare
  2. The advantages of knowing BOTH software engineering and data science
  3. Tips on learning terminology when joining a new company
  4. What bias is and why its important in data science

Key Milestones

Key Quotes

On Data Science

“I realized that I really really want to do NLP. And I realized that you can’t do NLP back then as a software engineer, you need to be a research scientist. So thats what got me in the PhD program focused on text mining, NLP and search.”

“As I was about to graduate, thats when this whole data science world was growing, and then I just jumped right in.”

“You don’t really have to always build up sophisticated models. You need to have your creative hat on at all times.”

“One easy way to know what’s in the document is by surfacing topics. That means finding keywords that characterize a document by using concepts from text mining and NLP. This can be as simple as looking for keywords that are frequent in each document and then surfacing that to the user.”

“15-20% of my time goes into analyzing the data I’m going to work with and how to process it.”

“I think people with a computer science background have a lot more skills with coming up with demos or web apps than those from other backgrounds like statistics.”

On working in healthcare

“In the healthcare world the focus is very narrow. The vocabulary is limited. I would say even the set of problems are quite limited.”

“The area I worked on is analyzing electronic medical records which are completely unstructured, narrated by doctors.”

The importance of math in data science

“I would say I’m more a consumer of math. I use math maybe like .5 percent of my time because the math is already done for you in the tools.”

“You don’t have to be excellent at math but you have to have a very good intuition of how it works because thats how you can make your models work really well.”

On biases

“I was analyzing clicks to improve the relevancy of a search system and I noticed that all the clicks are at the tops of the search results. So that introduces a bias.”

“Because we’re training the machine to think like a human we’re going to have the problem of false positives. And also educate the product side of the company that its not going to be perfect if a machine does it.”

In hindsight

“I wish I knew more about startups and venture capitalism and product development. I knew how to do coding when I was 18, but I didn’t know how to make it a product.”