During a recent session of UNCC’s “Big Data Analytics for Competitive Advantage” we had a general discussion on the state of the field of data science. Here are some of my responses.

1. Based on your investigation, what is data science?

As with all words, descriptive definitions arise from general usage. Much of what I have read and heard (some of which I shared on Twitter, https://twitter.com/praeducer) can be broken down into a few main points:

  • Data science is an interdisciplinary field that brings together computer science, mathematics, and statistics to solve problems in data-generating domains.
  • It augments the decision-making, often with predictive analytics, of domain experts using data.
  • It is generally performed by teams, or rare individuals, that have a variety of skills and knowledge including: systems engineering, software engineering, database administration, machine learning, data visualization, and business analysis.
  • It can be performed with traditional tools such as spreadsheets (e.g. Excel) and relational database systems (e.g. SQL Server or MySQL) but is most often associated with modern scalable cloud computing systems such SAS and Hadoop. A more interactive exploration of the data in these systems is also common, using visualization tools (like SPSS Modeler and Tableau) or statistical programming languages (like R).

2. Is data science a science?

Being a science across many disciplines, it is not just a “hard science” (like physics or chemistry) where a rigorous execution of the scientific method is necessary. It is also a humanity, where ethics and social impact must be considered. So, yes, it is at least a science. Work in data science should produce testable predictions with the highest degree of accuracy and objectivity possible. Even when integrating more heavily with social sciences (which can still undergo the rigor of any natural science), it should rely heavily on quantifiable data and mathematical models.

It could be described as a meta-science (i.e. science about science). While there may be some generalizations specific to the field of data science, many of the concepts and techniques associated with data science can be applied to any science. Research and application in this field could benefit any other field that can act on observable data. Skills, knowledge, and tools developed in data science are useful to, and shared across, many other sciences.

3. (a) What are your strengths as an aspiring data scientist?

My background is primarily in software engineering and web systems administration. I will excel at any programming or scripting tasks particularly in object-oriented, procedural, or functional languages. My strongest languages are JavaScript, PHP, and bash scripting since I use them everyday at Red Ventures. Academically, I am also experienced in Java and C++ with a basic understanding of Python and Prolog. My understanding of networking and cloud computing is solid. I know how to manage Microsoft and Linux-based server clusters to support web applications (especially content management systems).

The problem domains I know best are digital marketing, the music business, and mental health. I currently work in the Digital Marketing division of Red Ventures, tackling tasks in natural search (SEO), paid search (SEM), and customer experience (web UI/UX optimization). Much of my career to date was spent in the music industry as an audio engineer (both in studio and for live sound) and as a production manager (organizing teams to produce entertaining events). My work at a mental health clinic and my academics in cognitive science also make me well suited for solving human resource problems involving happiness, purpose, and productivity.

3. (b) How would you demonstrate them to the prospective employer?

My love for content marketing gives me a great place to start: my web presence. To visualize my skills, I have code up on GitHub (https://github.com/praeducer) and a portfolio available on Behance (https://www.behance.net/paulprae). To visualize my knowledge, they can read my academic papers at Academia.edu (https://uga.academia.edu/PaulPrae) or my professional blog (http://blog.paulprae.com/).

For a live demonstration, I am comfortable scripting out data processing solutions in languages I am familiar with at anytime. I could also architect a basic infrastructure for managing their data and even collecting it if it is web-based data. If the problems we were solving were in marketing, I could layout the foundation for a digital marketing strategy and emphasize any solutions particularly beneficial to them. If we were in the human resource domain, I could ask the right questions to see how they could boost morale or productivity. If the company was unsure of the emotional state of their workforce, I could develop a strategy and a system for surveying their colleagues.

4. Choose a company to analyze:

Red Ventures is a data-driven company utilizing many proprietary sales and marketing technologies. We are always forward-thinking, looking to aggressively fill any skill gaps and exploit any existing strengths as fast as we can. Our data science team only emerged recently. They are currently focusing on learning and exploration.

4. (a) List the data skills you think are needed:

It is a combination of gaining skills and applying skills to reach certain end results that may be needed. They are potentially missing or acquiring:

  • A dedicated software engineer.
  • A central system for advanced predictive analytics.
  • A cohesive data management lifecycle.
  • Confidence and understanding of the best tools to get data science work done.

4. (b) List the skills the company has:

Red Ventures has incredible domain expertise in direct sales and digital marketing. Our business analysts tend to be fairly technical with a good grasp on the advanced features of Excel and enough SQL skills to make them dangerous. Our engineers are fantastic at building and supporting advanced web applications. Most of our software engineers are full-stack developers. Any engineer is capable of collecting, processing, storing, and presenting data from across our sales and marketing activities. Our IT operations staff is excellent at automating work and scaling out systems. We also have the foundation for enterprise-level business intelligence in place.

5. (a) How many data scientists are there in the US?

This is a really hard question because there is no definitive, collectively agreed upon definition for a data scientist (e.g. like one who performs data science as I described it above). Besides this, there are many potentially synonymous titles or job roles that have heavily overlapping skill and knowledge requirements. Some titles in contention are analyst, statistician, and research scientist. We may be able to apply some natural language processing to job or career sites, like Indeed.com or LinkedIn.com, to figure these points out.

In general, calculating this number would take a few challenging steps. After strictly defining the role of a data scientist, we would have to come up with some measure for how closely related other job roles and their title are to this definition. Then we would accept other titles based on a certain amount of accuracy or closeness to our definition of a data scientist. Finally, we would have to find a database of the titles of people living in the US, say using LinkedIn. Assuming this is a sound sample of the US population, we could then take a count of all data scientist titles and increase it proportionately to the total population of the US.

I looked for a study that attempted to run through a process like this. Some of the basic steps were performed by someone at NC State a few years ago for a few related titles (http://analytics.ncsu.edu/?page_id=4025). At that point in October of 2011, 394 results came back for the exact title “data scientist”. I repeated the same thing now, in January of 2015, and I am getting 14,171 results. If I limit my search to the United States, I get 7,872 results. That is a huge increase that could be influenced in part by the increased popularity of data science (so people are switching to that title for personal branding reasons) or from other things such as improved search relevancy by LinkedIn. Either way, there are still a lot of other titles that could potentially be included plus those individuals that are not publicly searchable on LinkedIn.

5. (b) How many job openings are there for data science?

This has many of the same issues as the last question. There may be more data points to work with, though, since companies are aggressively recruiting people in this field (shown through the amount of recruiters that message me). It looks like about 2% of all job postings on Indeed.com are related to data science according to the first graph below. I searched for all jobs in the United States without any keywords and found 2,351,687 results. That could mean there are about 47,033 jobs related to data science available. Of course, a lot of this depends on many things like how Indeed.com performs these queries and on where they get this information.

Here are some other example query results:

5. (c) What are the trends?

As you can see above, depending on my queries, the trends vary. The generic job titles appear to be slowly declining overall. I would guess this could be due to people hiring more specific skill sets or due to jobs actually getting filled. The trends tend to be up and to the right for specific big data technologies. Some of these results increase rapidly after 2010, then have a small drop around 2014, and then a spike upwards again. It is interesting that big data technologies and “data scientist” all share a similar curve.

Some of the graphs follow an almost exponential curve. This could almost make sense since there are several analysts plotting an exponential curve in the growth of data (e.g. http://blog.thomsonreuters.com/index.php/big-data-graphic-of-the-day/, and page 2 of https://www.atkearney.com/documents/10192/698536/Big+Data+and+the+Creative+Destruction+of+Todays+Business+Models.pdf/f05aed38-6c26-431d-8500-d75a2c384919). Hopefully, good data science will prevent the need for data scientists to increase proportionately to the growth of data ;P