Data Science Interview Questions

COSO IT Team • February 18th, 2019 • 4 Comments

Question 1- Python or R – Which one would you prefer for text analytics?

Answer -

Python: Python is a very interactive language and mostly preferred for the domain of Data Science because of its flexibility and the availability of packages and modules crucial for Data Science and Machine Learning. With Data Science inclining towards machine learning and then further into Deep Learning python offers the compatibility a lot of software scientists look for. With the popular open source machine learning platform TensorFlow based on Python, it makes the framework of Python more powerful for Data Science and Machine Learning. The sub-modules of Python like Scikit learn, Scipy, Numpy, pandas, matplotlib offer a lot for helping you out with analyzing, cleaning, visualizing, testing and deploying datasets on to your model or algorithm. Python also is a very fast language compared to other traditional language approach to Data Science and is one of the easiest and near-written English language. Being nearly complicated syntax free, it focuses more on the perspective use of the programmer and thus becoming the most favored language for Data Science and Machine Learning.
R: R is a very popular alternative to Python for the domain of data science. Though R is a tool more inclined towards data visualization rather than towards the aspect of deployment of datasets for machine learning models, R is still one of the most actively used powerful language and it offers powerful model interpretability and a reliable community support. But for a fact, R has a steeper learning curve because of its complex paradigm and programmers without prior coding experience may find it very difficult to join all the pieces in R. Since the IDE of R called RStudio offers four monitoring window panels with a lot of processes running on the background to give you a detailed picture of the architecture of what's running under the hood in compared to Python's most popular interactive IDE called Jupyter Notebook, R has been seen to be a syntax based and complex language than Python but it offers extensive tools in Data Visualization and is preferred over Python for the domain of Data Science for those who exclusively work in Analytics.

Python is the answer as it has libraries like pandas which provide easy data structures and high-performance for data analysis.

Question 2 - What is logistic regression? Or State an example when you have used logistic regression recently.

Answer - Logistic Regression is the preferred model as it is the best technique to predict the outcome from a linear combination. For e.g., if you want to predict if a person will lead the election or not, then to use the binary prediction with the outcome as 0 or 1 that is win/lose. The predictor variables are the amount of money that is spent for the election of the specific candidate and the time spent in conducting the campaign, etc.

Question 3 - Differentiate between univariate, bivariate and multivariate analysis.

Answer - The above-mentioned terms are the descriptive statistical analysis techniques that are based on the number of variables which takes part at one given point in time.

If the pie chart of a particular sales which is based on a territory has only one variable involved then it is referred to as univariate analysis.
If the analysis is a difference between two variables at a particular time on a scatter plot then it is referred to as bivariate analysis.
If the analysis deals with more than two variables then the variable effect on the response is called as multivariate analysis.

Question 4 - What do you understand by the term Normal Distribution?

Answer - When the data is distributed in different ways with a bias to either the left side or to the right side or it can be both. The chances are that the data is distributed around a centric value which is not biased to the left or right and has a normal distribution which is in the shape of a curved bell. The variables are randomly distributed to form a symmetric bell curve.

Question 5 - What is Collaborative filtering?

Answer - The process that is used by almost all the recommender systems to find the different patterns or information by collaborating the viewpoints, data sources from various points and multiple agents.

Question 6 - What is the difference between Cluster and Systematic Sampling?

Answer - Cluster Sampling is actually a technique which is used when it becomes hard to study the population targeted across a large area and where we cannot apply simple random sampling. The cluster sample is the probability sample where the individual sampling unit is either a collection or a cluster of elements.

Systematic Sampling is a statistical technique where the elements are selected from an ordered sampling frame. The list in the systematic sampling is progressed in a circular manner so that when you reach the end of the list, you can progress the list from the top again.

Question 7 - Are expected value and mean value different?

Answer - The above two stated terms are actually not different but are used in different contexts. The expected value is actually referred in a random variable context and the mean value is referred when someone is talking about probability distribution.

Question 8 - What does P-value signify about the statistical data?

Answer - P-value actually determines the significance of the results after the hypothesis test in statistics. It helps in drawing a conclusion between the values of 0 and 1. If the value is more than 0.05, then the evidence is weak and the null hypothesis cannot be rejected. If the value is less than and equal to 0.05, then the evidence is weak and the null hypothesis cannot be rejected and if the value is equal to 0.05 then the value can be denoted either way.

Question 9 - Do gradient descent methods always converge to the same point?

Answer - No, the gradient descent methods do not always converge to the same point as there are few cases where it reaches a local minima or local optima point.

Question 10 -What is the difference between Supervised Learning an Unsupervised Learning?

Answer - Supervised Learning refers to when an algorithm learns something from the training data so that knowledge can be applied to test the data. An example is a Classification.

Unsupervised Learning refers when an algorithm does not learn anything from beforehand as there is no training data. Clustering is an example to this.

Question 11 - What is the goal of A/B Testing?

Answer - It is a test which is based on a statistical hypothesis for a randomized experiment which consists of two variables. The goal of the A/B testing is to identify the changes in the web page or to increase the outcome of an interest.

Question 12 - What are an Eigenvalue and Eigenvector?

Answer - Eigenvalues are referred to as the strength of the transformation in the direction of the eigenvectors.

Eigenvalues are actually used for the understanding of the linear transformations. We calculate the eigenvectors for a correlation or a covariance matrix.

Question 13 - How can outlier values be treated?

Answer - Outliners values can be actually identified by using a univariate or an analysis graphic method. If the number of outlier values is few then it is assessed individually and if it is large then the values are substituted with either the 99th or the 1st percentile values. The two ways are to either change the value and bring it in within the range or to just simply remove the value.

Question 14 - How can you assess a good logistic model?

Answer - There are various ways of accessing the results of a logistic regression model -

Using the classification matrix in order to look at the true negatives and false positives.
Concordance that helps in identifying the ability of the model if the event is happening or not.
The lift which helps in accessing the model by comparing it with random selection.

Question 15 - What are various steps involved in an analytics project?

Answer - The steps that are involved in the analysis of an analytics project are: First, understanding the business problem. Then exploring the data so that you are familiar with it. Then to prepare the data for modeling by outliers, then treating the missing values, variables etc. After the preparation of the data, run the model and analyze the result so that you can approach the problem. This has to be done till the best outcome is achieved. Then you have to validate the model with a new data set. Then implement the model and track the result so that you can analyze the model.

Question 16 - During analysis, how do you treat missing values?

Answer - One can treat the missing values after they are identified or after identifying the missing values. If the analyst observes any pattern then he has to concentrate on that in order to conclude a meaning business insight. If there is no such pattern observation then the missing values can be substituted with either the mean or the median values. Or one can ignore the values.

Question 17 - Explain the box-cox transformation in regression models.

Answer - There is always some reason that does not satisfy the response of a variable for a regression analysis as it does not satisfy one or more than one of the assumptions. The residuals can either be the prediction curve or can be followed by a skewed distribution. In such cases, one has to perform the response variable so that the data meet the assumptions expected. A box-cox transformation is actually a statistical technique that helps in transforming the non-morula dependent variable into a normal shape. If the data is not normal then the technique assume normality. The box-cox can help in performing a broader number of tests.

Question 18 - What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?

Answer - In Bayesian Estimate, we actually do have an insight into the problem or the data beforehand. There are several parameters which explain the data and hence we can look into the other parameters. Due to this, we can get multiple models for making the predictions that is one for every pair of the parameter. MLE does not have any prior consideration so it is the same way we use Bayesian but there with some prior data.

Question 19 - How will you define the number of clusters in a clustering algorithm?

Answer - The clustering algorithm is actually not specified but, in this context, it is asked in reference to K-Means where the K is defined as the number of clusters. The main objective of the clustering is to group the similar elements in a way so that the elements within the group are similar to each other but all the groups are different to each other.

Question 20 - Is it possible to perform logistic regression with Microsoft Excel?

Answer - It is very much possible to perform logistic regression with Microsoft Excel. There are two ways to do so:

One is to use the Add-ins which is provided on a number of websites that one can use.
Two is to use the fundamentals of the logistic regression and use it in Excel's power to be able to build up a logistic regression.

Question 21 - What is the difference between skewed and uniform distribution?

Answer - Uniform distribution is referred to when the observations in a dataset are spread across a range of distribution. There are no clear perks in a uniform distribution. When the distributions have more observations on either one side of the graph then it is referred to as skewed distribution. When the distributions have few observations on the left side then it is said to be skewed left and the distributions which have few observations on the right side then it is called as skewed right.

Question 22 - You created a predictive model of a quantitative outcome variable using multiple regressions. What are the steps you would follow to validate the model?

Answer - The question asked is about post model building exercise where we can assume that the null hypothesis is already tested for. The multicollinearity and the error of coefficients.

Question 23 - Why L1 regularizations cause parameter sparsity whereas L2 regularization does not?

Answer - Regularizations in the field of statistics or in the field of machine learning is actually used in order to include some extra information which is actually used to solve a problem in a much better way. L1 and L2 are two regularizations are the two generally used constraints that are added to optimize the problems.

Question 24 - How can you deal with different types of seasonality in time series modeling?

Answer - Seasonality in a time series occurs when the time series shows a repeated pattern over time. Let's take for example that the sales decreases during a holiday season and the sales of an air conditioner increases during the summer season are few examples of seasonality in a time series modeling. Seasonality makes the time series non-stationary when the average value of the variables is different at different periods of time. The best method to remove seasonality from a time series is actually differentiating s time series. The difference is actually defined as a numerical difference between a particular value and a periodic value.

Question 25 - Can you explain the difference between a Test Set and a Validation Set?

Answer - Test Set is actually a way to access the performance of the model that is by evaluating the generalization and also the predictive power. Training Set is actually to fit the parameters which are the weights. The Validation set is considered as a training set which is used for the selection of the parameters so as to avoid overfitting the model that is being built. And on the other hand, the test set is being used for testing and also evaluating the performance of the model that is trained by machine learning model.

Question 26 - What do you understand by statistical power of sensitivity and how do you calculate it?

Answer - Sensitivity is actually a common way that is used in validating the accuracy of the Logistic classifier, SVM classifier or the RF classifier. It is merely nothing but the True events which are predicted or the total events. The true events are the events that were true and which the model also predicted as true.

Calculating the sensitivity is pretty easy.

Senstivity = True Positives /Positives in Actual Dependent Variable.

Here the True positives are the events which are positive and are classified as Positives.

Question 27 -What is the importance of having a selection bias?

Answer - Selection Bias actually occurs when there is no appropriate randomization which is achieved when you are selecting the individuals, or groups or the data which is to be analyzed. The selection bias implies to the sample that is obtained and that does not represent the population which was intended to be analyzed. It consists of the Sampling Bias, the Data, Time Interval and Attribute.

The different types of selection bias are

Time Interval - When the trial can be terminated at the earliest at an extreme value but at the extreme value, it is likely to reach the variable which has the largest variance, even if the variables have similar mean.
Data - When the subsets of the data are chosen in order to support a particular conclusion or the rejection of a bad data instead of previously stated data on general criteria.
Sampling bias - It is a systematic error that is due to a non-random sample of an entire population that causes few members to be included rather than resulting it to a biased sample.
Attrition - It is a kind of selection bias which is caused by attrition that is the loss of participants, that discounts the trial subjects and the tests which did not make it to completion.

Question 28 - What is the advantage of performing dimensionality reduction before fitting an SVM?
Answer - Support Vector Machine (SVM) algorithm actually performs better in a reduced space. It is beneficial as it performs dimensionality reduction before you fit an SVM as the number of features are large when you compare it to the number of observations.

Question 29 - Explain cross-validation.
Answer - Cross-validation is a model validation technique which helps in evaluating how the outcomes of a statistical analysis can generalize the independent data set. It is mostly used in the backgrounds where one can forecast the objective and estimate how the models accurately behaving to accomplish the results. The ultimate goal of the cross-validation is to terminate the data-set to test the model in the entire training phase. In order to limit the problems like overfitting and to gain insight into how the model is going to generalize to an independent data set.

Question 30 - What are the important skills to have in Python with regard to data analysis?
Answer - The important skills that one needs to possess while performing data analysis using Python are:

One should know about the built-in data types which are lists, tuples, sets and dictionaries.
One needs to master N-Dimensional NumPy arrays.
To be a master of pandas and data frames.
The ability to perform the element-wise vector and the matrix operations on the Numpy arrays. It requires having a mindset who comes from a traditional software background which is used for loops.
You should know how to use Anaconda distribution and the conda package.
Ability to be able to write list comprehension in an efficient manner.
To know the scikit-learn package.
Knowing how to perform the Python script and to optimize the bottlenecks.

Question 31 - What are the differences between overfitting and underfitting?
Answer - In statistical machine learning, the most common task is to either fir the model to a training data set in order to make predictions on a general untrained data.

Overfitting - The statistical model actually describes the random error. It occurs when the model is very complex that is having a lot of parameters which is relative to the number of observations. The model is overfitting means the predictive is a poor performance as it reacts with minor fluctuations in the training data.
Underfitting - This occurs when a statistical machine learning model or the machine learning algorithm captures the underlying trend of the data. It occurs if the model is to fit either a linear model to a non-linear data. This model too would have a poor predictive performance.

Question 32 - How does data cleaning plays a vital role in the analysis?
Answer - Data cleaning plays a vital role in the analysis because:

Cleaning the data from multiple resources will help in transforming it into a format that the data scientists or the data analysts can work with.
Data cleaning helps in increasing the accuracy of the model in the machine learning algorithm
If the number of data sources increases then the time that is taken to clean the data also increases in an exponential way due to the increase in sources and a huge amount of data that is generated by these sources.
It takes up to 80% of the time to just clean the data in order to make it efficient and plausible for analysis.

Question 33 - What are essential skills and training needed in Data Science?
Answer - The essential skills and training that one needs in Data science is to learn to Communicate Storytelling, Learning Optimization in Statistical Machine Learning, Computing with Big Data cloud, Having the knowledge about the business domain, to be able to visualize with the help of tools and to have a programming idea about the fundamentals of computing.

Question 34 - Explain Data Science Vs Machine Learning.
Answer - Machine Learning and statistics are a part of data science. Machine learning explains that the algorithms depend on some data. Which is used further to train the set, or to fine-tune some of the models or the parameters of the algorithm. Data Science covers various topics which are listed below:

Data integration.
Distributed architecture.
Automating machine learning.
Data Visualization.
Dashboards and BI.
Data engineering.
Deployment in production mode.
Automated, data-driven decisions.

Question 35 - What does Machine Learning mean for the Future of Data Science?
Answer - Data Science includes Machine Learning. Machine learning is the ability of a machine to gather the knowledge from the data that is provided which it learns from. But there are few machines which can also learn without the data. The data science is pushing forward to increase the importance of machine learning. This is being used in various industries and soon will be used by all. Machine learning is efficient as it stores the data which has the ability to take the algorithms. It is a standard need for data scientists in all their fields that they are working on.

Question 36 - What is meant by logistic regression?
Answer - Logistic regression is a method to fit the regression curve that is y = f(x) where the y is a categorical variable. It is a classification of the algorithm. It is just used for predicting the binary outcome which is either 1/0, Yes/No, True/False with the given set of independent variables. It helps in representing either the binary or the categorical outcome. It is a regression model in which the response of the variable is categorical such as True or False or 0/1. This essentially means that it measures that probability of the binary response.

To perform logistic regression, we use this command in R: glm( response ~ explanantory_variables , family=binomial)

Question 37 - What is meant by Poisson regression?
Answer - Poisson Regression is the data that is collected in counts. The variables are discreet and have possible outcomes. The binomial counts are the number of successors that are fixed in the number of trials. Poisson counts are actually the number of occurrences of the events which occurs in the certain interval of time. It has no upper bound and the binomial counts are only between two values which are 0 and n.

To perform logistic regression in R, we use the command: glm( response ~ explanantory_variables , family=poisson)

Question 38 - What are distance measures in R statistics?
Answer - Distance measures - Similarity, dissimilarity, and correlation. It is a mathematical approach which helps us in measuring the distance between two objects which is further used as a commuting distance in order to compare the objects. This can be concluded in three different points which are based on the comparison which are:

Similarity - It is a measure that ranges from 0 to 1.
Dissimilarity - It is a measure that ranges from 0 to Infinity.
Correlation - It is a measure that ranges from +1 to -1.

Question 39 - What is a correlation in R?
Answer - It is a type of technique which is used for investigating the relationship which is between two quantitative and continuous variables.

Position correlation - In this case, both the variables either increase or decrease together.
Negative correlation - In this case, only one variable increases and the other decreases.

Question 40 - What is term Pearson’s Correlation Coefficient?
Answer - This is actually a statistical technique which provides with a number that tells us how a strong and weak relationship is between the objects. Basically, it is not a measure that describes the distance. The measure actually describes the bound between the two objects. It is represented by "r" and it ranges from -1 and +1.

If r is close to 0, that means there is no relationship between the two objects.
If r is positive, that means if one of the objects that gets larger than the other also gets larger.
If r is negative, that means that if one of them gets larger than the other gets smaller.

Knowledgebase Related Comments

Leave your comment

Your Name:

Your Comment:
Note: HTML is not translated!

Submit

Data Science Interview Questions

Knowledgebase Related Comments

Leave your comment

Explanation of 3V's Model of Big data Given by Doug Laney

Difference between Rule based and AI-based Chat-bot

Evolution of chat bots from NLP to NLU

How is prediction made with Big Data Analytics?

New Trend to use SOLR as a data store!

About us

Quick Links

Social Network