Do data scientists conduct economic models

What is a data scientist?


After recently completing my doctoral program in statistics, I had started looking for a job in statistics over the past few months. Almost every company I considered had a job advertisement with the job title " Data scientist In fact, it felt like the days of using job titles from Statistical Scientist or Statistician saw long gone. Had a data scientist really replaced what a statistician was, or was I also wondering what titles he had?

Well, most of the qualifications for the jobs felt like things that would qualify under the title of statistician. Most jobs wanted a PhD in statistics (), most needed knowledge in experimental design (), linear regression and anova (), generalized linear models (), and other multivariate methods like PCA (). as well as knowledge of a statistical computing environment such as R or SAS (). Sounds like a data scientist is really just a code name for a statistician. ✓ ✓ ✓ ✓ ✓✓✓✓✓✓✓

However, every interview started with the question, "Are you familiar with machine learning algorithms?" For the most part, I had to try to answer questions about big data, high performance computing, and neural network topics, CART, support vector machines, tree boosting, unattended models, etc. Basically statistical questions, but at the end of each one I couldn't help but lose the feeling in interviews that I knew less and less what a data scientist is.

I'm a statistician, but am I a data scientist? I work on science problems so I have to be a scientist! And I work with data too, so I have to be a data scientist! And according to Wikipedia, most academics agree with me (https://en.wikipedia.org/wiki/Data_science, etc.)

Although the use of the term "data science" has exploded in business, many scientists and journalists see no difference between data science and statistics.

But if I do all of these interviews for a position as a data scientist, why does it feel like they never ask me statistical questions?

Well, after my last interview, I wanted a good scientist to do it, and I was looking for data to solve this problem (hey, I'm a data scientist after all). After countless searches on Google, however, I had the feeling that I was grappling with the definition of a data scientist again. I didn't know what a data scientist was exactly because there are so many definitions for it (http://blog.udacity.com/2014/11/data-science-job-skills.html, http: // www -01 .ibm.com / software / data / infosphere / data-scientist /) but apparently everyone was telling me that I wanted to be one:

At the end of the day, I realized that "What is a data scientist?" A very difficult question to answer is. Heck, there were two whole months in Amstat that they spent time answering this question:

Well, I have to be a sexy statistician to be a data scientist, but hopefully the cross-validated community can shed some light on it and help me understand what it is to be a data scientist. Aren't all statisticians data scientists?


(Edit / update)

I thought this might spice up the conversation. I just got an email from the American Statistical Association about a position at Microsoft seeking a data scientist. Here is the link: Data Scientist Position. I think this is interesting as the role of position applies to many of the specific traits we have talked about, but I think many of them require a very strict background in statistics and also contradict many of the answers given below. In case the link stops working, here are the characteristics Microsoft expects from a data scientist:

Basic job requirements and skills:

Business domain experience with analytics

  • You must have experience in various relevant business areas in using critical thinking skills to conceptualize complex business problems and their solutions using advanced analytics in large real-world business data sets
  • The candidate must be able to independently conduct analysis projects and help our internal clients understand the results and translate them into actions that will benefit their business.

Predictive modeling

  • Cross-industry experience in predictive modeling
  • Business problem definition and conceptual modeling with the customer to identify key relationships and define the system scope

Statistics / Econometrics

  • Exploratory data analysis for continuous and categorical data
  • Specification and estimation of structural model equations for business and consumer behavior, production costs, factor demand, discrete selection, and other technology relationships as needed
  • Advanced statistical techniques for analyzing continuous and categorical data
  • Time series analysis and implementation of forecast models
  • Knowledge and experience of working with multiple variable problems
  • Ability to assess model correctness and perform diagnostic tests
  • Ability to interpret statistics or economic models
  • Knowledge and experience in creating discrete event simulations and dynamic simulation models

Data management

  • Familiarity with using T-SQL and analytics for data transformation and applying exploratory data analysis techniques to very large real-world data sets
  • Respect for data integrity, including data redundancy, data accuracy, abnormal or extreme values, data interactions, and missing values.

Communication and collaboration skills

  • Work independently and be able to work with a virtual project team looking for innovative solutions to challenging business problems
  • Collaborate with partners, apply critical thinking skills, and drive analytical projects end-to-end
  • Excellent oral and written communication skills
  • Visualize the analysis results in a form that can be used by a variety of stakeholders

Software packages

  • Advanced statistical / econometric software packages: Python, R, JMP, SAS, Eviews, SAS Enterprise Miner
  • Data exploration, visualization and management: T-SQL, Excel, PowerBI and equivalent tools

Qualifications:

  • At least 5 years of relevant experience is required
  • Postgraduate studies in the quantitative field are desirable.





Reply:


There are a couple of humorous definitions that haven't been given:

Data Scientist: Someone who creates statistics on a Mac.

I like this one as it works well with the more-hype-than-substance angle.

Data Scientist: A statistician who lives in San Francisco.

These reefs on the west coast taste just as much of it all.

Personally, I find the discussion (in general and here) a bit boring and repetitive. When I was thinking about what I wanted - maybe a quarter of a century or more ago - I was targeting a quantitative analyst. That's still what I do (and love!) And it overlaps and largely covers what has been given in various answers here.

(Note: there is an older source for quote two, but I can't find it right now.)







People define data science differently, but I think the common part is:

  • practical knowledge of handling data,
  • practical programming skills.

Contrary to its name, it is rarely "science". That is, in data science, the emphasis is on practical results (as in engineering), not on the evidence, mathematical purity, or accuracy that is characteristic of academic science. Things have to work, and there is little difference whether it is an academic paper, using an existing library, your own code, or a spontaneous hack.

Statistician does not require a programmer (can use pen & paper and special software). Also, some data science job postings have nothing to do with statistics. For example, it's data engineering like processing big data, even if the most advanced math averages there (but personally I wouldn't call this activity "data science"). In addition, "Data Science" is hyped, so tangentially related jobs use this title - to entice the applicants or to spark the ego of the current workers.

I like the taxonomy of Michael Hochster's answer to Quora:

Type A Data Scientist: The A is for analysis. This type is primarily about understanding data or working with it in a relatively static way. The Type A data scientist is very similar to (and can be) a statistician, but knows all the practical details of working with data that are not taught in the statistics curriculum: data cleansing, methods for dealing with very large amounts of data, visualization, deep Knowledge of a particular domain, good writing of data, and so on.

Type B Data Scientist: The B is for buildings. Type B data scientists have a statistical background similar to that of Type A, but they are also very good programmers and possibly skilled software engineers. The type B data scientist is primarily interested in the use of data "in production". He creates models that interact with users and often contain recommendations (products, people you might know, ads, movies, search results).

In this sense, Type A Data Scientist is a statistician who can program. But also for the quantitative part there are people with more background knowledge in computer science (e.g. machine learning) than normal statistics or those who e.g. B. focus on data visualization.

And the data science Venn diagram (here: hacking ~ programming):

see also alternative Venn diagrams (this and that). Or even a humorous tweet with a balanced list of typical data scientist skills and activities:

See also this post: Data Scientist - Statistician, Programmer, Consultant and Visualizer? .







There are a number of surveys in the field of data science. I like this one because it tries to analyze the profiles of people who actually do data science jobs. Instead of using anecdotal evidence or prejudice from the author, they use data science techniques to analyze the DNA of data scientists.

It's pretty insightful to look at the skills listed by data scientists. Note that the top 20 skills include many IT skills.

In today's world, a data scientist is expected to be an all-rounder. A self learner with a solid quantitative foundation, programming ability, infinite intellectual curiosity, and great communication skills.

TO UPDATE:

I'm a statistician, but am I a data scientist? I work on science problems so I have to be a scientist!

By the time you're doing a PhD, you are most likely already a scientist, especially if you've published articles and actively researched. However, you don't have to be a scientist to be a data scientist. Some companies, such as Walmart (see below) that require a PhD, hold certain roles, but data scientists typically have BS and MS degrees, as shown in the examples below.

As you can see from the table above, you will most likely need good programming and computing skills. In addition, data science is often associated with a certain, often "deep" competence in machine learning. You can certainly call yourself a data scientist if you have a PhD in statistics. However, doing a doctorate in computer science in top schools may be more competitive than statistics graduate because they have pretty good knowledge of applied statistics, which is complemented by good programming skills - a combination sought after by employers. To counter them, you need to acquire strong coding skills so that you can be very competitive on a fair balance. What is interesting is that typically all stat PhD students have programming experience, but in data science the requirement is often much higher.

For me, the advantage of a doctorate in statistics lies in the problem that is expressed in the remainder of the phrase "jack of all trades" that is normally dropped: "a master of none". It's good to have people who know a little about everything, but I'm always looking for people who also know something in-depth, regardless of whether statistics or computer science are not that important. What matters is the guy is able to get to the bottom, it's a handy trait when you need it.

The survey also lists the top data scientist employers. Microsoft is apparently at the very top, which surprised me. If you want to have a better idea of ​​what they're looking for, it helps to search LinkeIn with "Data Science" in the "Jobs" section. Below are two excerpts from MS and Walmart's job openings on LinkedIn to make a point.

  • Microsoft data scientist

    • Over 5 years of experience in software development in the construction of data processing systems / services
    • Bachelor or higher qualifications in computer science, EE or mathematics with a specialization in statistics, data mining or machine learning.
    • Excellent programming skills (C #, Java, Python, etc.) when handling large amounts of data
    • Basic knowledge of Hadoop or any other big data processing technology
    • Knowledge of analytics products (e.g. R, SQL AS, SAS, Mahout, etc.) is an asset.

Note that knowledge of stat packages is only an advantage, but excellent programming skills in Java are required.

  • Walmart, data scientist

    • PhD in Computer Science or a similar field, or MS with at least 2-5 years of related experience
    • Good functional coding skills in C ++ or Java (Java is very preferred)
    • must be able to spend up to 10% of the daily work day writing production code in C ++ / Java / Hadoop / Hive
    • Expert knowledge in one of the scripting languages ​​such as Python or Perl.
    • Experience in handling large amounts of data and distributed computing tools is an advantage (Map / Reduce, Hadoop, Hive, Spark, etc.)

The doctorate is preferred here, but only the focus on computer science is mentioned. Distributed computing with Hadoop or Spark is likely an uncommon skill for a statistician, but some theoretical physicists and applied mathematicians use similar tools.

UPDATE 2:

"It's time to kill the data scientist," says Thomas Davenport, who wrote the 2012 article "Data Scientist: The Sexiest Job of the 21st Century" entitled "Data Scientist: The Madness of Data Scientists" :

What does it mean today to say that you are, or want to be, or want to hire a “data scientist”? Unfortunately not much.







I read this somewhere (EDIT: Josh Will explains his tweet):

A data scientist is someone who knows statistics better than any other programmer, and better at programming than any other statistician.

This quote can be explained briefly through this data science process. The first look at this schema looks like, “Well, where's the programming part?” But when you have tons of data you need to be able to handle it.







I've written several answers and every time they got long I ended up settling on a soap box. But I think this conversation hasn't fully explored two important factors:

  1. The science in data science. A scientific approach is one where you try to destroy your own models, theories, features, techniques, etc. and only if you don't do so do you accept that your results may be useful. It's a mindset, and many of the best data scientists I've met have a science background (chemistry, biology, engineering).

  2. Data science is a broad field. A good data science result usually requires a small team of data scientists, each with their own specialist areas. For example, one team member is more rigorous and statistical, another is a better programmer with a technical background, and another is a strong consultant with business acumen.All three are quickly familiar with the subject, and all three are curious and want to find out the truth - however painful it may be - and do what is in the best interests of the (internal or external) customer, even if the customer isn't does. I do not understand.

The fad in recent years - I think it has faded - is to recruit computer scientists who are knowledgeable about cluster technologies (Hadoop ecosystem etc) and say this is the ideal data scientist. I think this is what the OP came across and I would advise the OP to reinforce their strengths in rigor, correctness and scientific thinking.







I think Bitwise covers most of my answer, but I'll add my 2c.

No, I'm sorry, but a statistician is not a data scientist, based at least on how most companies define the role today. Note that the definition has changed over time and it is a challenge for practitioners to make sure they stay relevant.

I'll give some general reasons why we turn down candidates for "data scientist" roles:

  • expectations about the scope of the work. Usually the DS needs to be able to work independently. That is, there is no one to create the record for them to solve the problem assigned to them. So he has to be able to find the data sources, query them, model a solution and then often create a prototype that solves the problem. In many cases it is just a matter of creating a dashboard, an alarm or a live report that is constantly updated.
  • communication . It seems that many statisticians have difficulty "simplifying" and "selling" their ideas to business people. Can you just display a graph and tell a story from the data so that anyone in the room can refer to it? Note that this will ensure that you can defend every detail of the analysis if prompted to do so.
  • Programming skills . We don't need any programming skills at the production level as we have developers for that. However, it must be able to write a prototype and deploy it as a web service in an AWS EC2 instance. So programming knowledge does not mean the ability to write R scripts. Somewhere here I can probably add Linux fluently. So the bar is simply higher than most statisticians believe.
  • SQL and databases . No, he can't pick that up at work as he actually needs to tweak the basic SQL requirements he already knows and learn how to query the various DB systems that we use across the organization, including Redshift, HIVE, and Presto which uses its own variant of SQL. In addition, learning SQL on the job will cause the candidate to pose problems with any other analyst until they learn how to write efficient queries.
  • Machine learning . Typically, they have used logistic regression or some other technique to solve a problem based on a specific data set (Kaggle style). Even though the interview is based on algorithms and methods, it soon focuses on topics like feature generation (remember, you have to create the dataset, there is nobody to create it for you), maintainability, scalability and performance, as well as the related issues compromise. In some cases, you can read a relevant document from Google published in NIPS 2015.
  • Text analysis . Not a must, but some natural language processing experience is good. After all, a large part of the data is available in text form. As mentioned earlier, no one else has to do the transformations and clean up the text for you so that it can be processed by an ML or other statistical approach. Also note that today even CS graduates have already done a project that ticks this box.

Of course, you can't have everything for a junior role. But how many of these skills can you afford if you miss the job and take over?

After all, the most common reason for non-statisticians to be rejected is precisely the lack of even basic statistical knowledge. And somewhere there is the difference between a data engineer and a data scientist. However, data engineers typically apply for these roles because they often find that "statistics" are just the mean, variance, and normality. So we can make some relevant but scary ones statistical keywords Include in job descriptions to clarify what we mean by "statistics" and to avoid confusion.




Allow me to ignore the hype and buzzwords. I think "data scientist" (or whatever you want to call it) is a real thing and that is different from being a statistician. There are many types of positions that are effectively data scientists but are not given that name. An example are people who work in genomics.

In my view, a data scientist is someone who has the skills and expertise to design and study large amounts of complex data (e.g., high-dimensional data where the underlying mechanisms are unknown and complex).

This means:

  • Programming: Being able to implement analyzes and pipelines often requires a certain degree of parallelization and interfaces to databases and high-performance computers.
  • Computer Science (Algorithms): Design / select efficient algorithms so that the selected analysis is possible and the error rate is controlled. Sometimes this may also require knowledge of numerical analysis, optimization, etc.
  • Computer science / statistics (usually with a focus on machine learning): Design and implementation of a framework to ask questions about the data or to find "patterns" in it. This would include not only knowledge of various tests / tools / algorithms but also designing an appropriate holdout, mutual validation and so on.
  • Modeling: Often times, we want to be able to create a model that makes it easier to represent the data so that we can make useful predictions and gain insight into the mechanisms underlying the data. Probabilistic models are very popular.
  • Domain-specific know-how: An essential aspect for successful work with complex data is the consideration of domain-specific knowledge. So I would say that it is crucial that the data scientist has expertise in the field, is able to learn new areas quickly, or has a good interface with experts in the field who have useful knowledge on how to use the Can deliver data.