Dr. Julia Silge InteRview

Today I interviewed Dr. Julia Silge, the creator of janeaustenr::, tidytext::, qualtRics::, and author of Text Mining with R. I’m still recovering from a hand surgery, and this time the interview was done by using a voice-to-text app and email.


1. Why do you use R? I came to R when I was transitioning into data science as a career, after working in academia and ed tech. My programming experience at the time was a real mix of everything from C to awk to HTML but I was not professionally proficient in modern data science tooling. Basically I had the right academic background for a data scientist title but not the right specific technical toolkit. As I worked to update that toolkit, I learned some of both Python and R. Both are fantastic languages with different strengths. The main reason that R really hit the right note for me was the tidyverse and its functional programming flavor. This was about 2015 and I learned R “tidyverse first”; this programming idiom enabled me to become effective faster than basically anything I have experienced over my technical career.

2. Do you consider reproducible research a gold standard or an impossible dream? In my real-world work, the more I have adopted goals and practices around reproducible workflows, the more peaceful and happy I and my coworkers have been. These are truly practices that help everybody involved and reduce pain so I am a believer! On the other hand, I saw a really interesting talk at JSM in 2018 by Victoria Stodden about the limits of reproducibility from a computational standpoint. She emphasized how our tools (yes, even tools like Docker) are deeply imperfect.

3. How did you get the idea of creating janeaustenr:: and its successors in the NLP world? Specifically for janeaustenr, I was interested in learning how to build an R package and I had been using the text of Jane Austen’s novels in some of my first data science blog posts, so putting together a small data package was a good fit. In my opinion, often the best ideas for packages come from a real need an individual user has, who is then motivated to build the very thing they need! This is a big part of how tidytext also came to be as well; I spoke in more detail about this with Kelly O’Briant for the rOpenSci blog.

4. What kind of contributions would you like to see from the community to make NLP in R even better? As two examples, I’m really excited about the work that Emil Hvitfeldt is doing right now in textrecipes and that Ken Benoit and collaborators are doing with their quanteda.textmodels package. Machine learning for text in R is an area that is really active right now and the work coming out is so exciting to watch. There are multiple aspects that are great: how focused on the user experience the developers are, the high level of cooperation and openness I see, and how thoughtfully these solutions are being crafted for the text domain.

5. How can data scientists properly help to fight COVID19? I’ve seen both good and bad analysis and I’m sure that you have a thing or two to say with your science background. Well, the last thing we talked about speaks to this a lot, I think. I am an astrophysicist by training, I have a particular interest in NLP now, and my title is currently software engineer. I am a bit of a generalist. I don’t want to be flippant and only say, “Wash you hands and stay home,” but I do think it’s wise to think through how we can use our existing relationships, roles, and vocations to bring hope and renewal during this pandemic, instead of appropriating someone else’s role or expertise.

6. Can you tell me about a use case of your packages that amazed you? About two weeks ago, I got an email from someone who works in a school district, expressing thanks for the work on the rOpenSci qualtRics package that I maintain. He said he was using the package to help his school district make data-informed decisions about online learning and other tough decisions during this global pandemic. I was quite literally overwhelmed, because this note came during what was a tough week for me during these uncertain times. I was amazed that this open source work I had done was making a difference.

7. What do you consider that we can do as a community to fight gender wage imbalance in our industry? my former dogmatic economist colleagues from college say that “markets don’t discriminate”, I don’t buy that. I think we have plenty of evidence that labor markets are not efficient all the time, so I’m not too worried about folks who make such arguments. At the end of last year, I published some modeling based on Stack Overflow data that provides evidence for how more experienced women who code earn less for the same work, and how having dependents is associated with lower salary only for women. This fits in with other research addressing why women leave tech at higher rates (inability to advance, lower pay, unfair treatment). The big key to increased fairness and diversity in tech is supporting folks from underrepresented groups as they progress in their careers to more senior levels. The actions steps in this piece by Rachel Thomas are excellent.

8. What would you recommend for people from academia who are moving to industry? I am very happy that I work in industry now; I love seeing impact from my work on a short timescale and having flexibility in finding an employer that is a good fit for my values and preferences. The skills and signals of competence in industry are somewhat different than in academia, but it’s important for folks in academia to know that their skills are valuable. The important step is to build out a professional persona that communicates competence clearly, through signals like building a data science portfolio, networking and giving talks, or open source contributions (instead of signals like publishing papers). It looks like the hiring market may be especially tough in upcoming months due to global financial conditions, so resilience and commitment will be even more necessary during what can be a challenging transition.