Blog Details

A PHP Error was encountered

Severity: Notice

Message: Undefined index: geoplugin_countryName

Filename: include/top_header.php

Line Number: 39

Backtrace:

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/include/top_header.php
Line: 39
Function: _error_handler

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/blogpost.php
Line: 3
Function: include

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/controllers/Hachion.php
Line: 325
Function: view

File: /home/laxmiveena/public_html/test.hachtechnologies.com/index.php
Line: 315
Function: require_once

+1-732-485-2499 | trainings@hachion.co

A PHP Error was encountered

Severity: Notice

Message: Undefined index: geoplugin_countryName

Filename: include/top_header.php

Line Number: 94

Backtrace:

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/include/top_header.php
Line: 94
Function: _error_handler

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/blogpost.php
Line: 3
Function: include

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/controllers/Hachion.php
Line: 325
Function: view

File: /home/laxmiveena/public_html/test.hachtechnologies.com/index.php
Line: 315
Function: require_once

Visitors Count 34153 Recent user location: United States New York,

A PHP Error was encountered

Severity: Notice

Message: Undefined variable: recent_country

Filename: include/top_header.php

Line Number: 111

Backtrace:

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/include/top_header.php
Line: 111
Function: _error_handler

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/views/blogpost.php
Line: 3
Function: include

File: /home/laxmiveena/public_html/test.hachtechnologies.com/application/controllers/Hachion.php
Line: 325
Function: view

File: /home/laxmiveena/public_html/test.hachtechnologies.com/index.php
Line: 315
Function: require_once

Data Science Interview Questions with Answers

Gayathri

Data science is an emerging technology and it has many applications in E-commerce, Healthcare, Transport, Finance, Manufacturing and Banking sectors, etc. As the future scope of the Data Science field is immense, many job opportunities are available with great salaries. Aspirants who want to build a career in the Data Science stream may have a look at some most important interview questions and answers given below which enhances your theoretical knowledge.

1. Q: Which one would you choose among Python and R for text analytics and why?
Ans: In the case of text analytics, Python will be upper hand over R due to the following reasons: Python has much faster performance for all types of text analytics. However, R is the best suitable language for machine learning than mere text analysis. The Python has a Pandas library that offers easy-to-use data structures and high-performance data analysis tools.

2. Q: Explain the several steps involved in an analytics project?
Ans: Several steps involved in an analytics project are:

Understanding the business problems
Exploring the data and acquaint with similar
Arranging the data for modeling by means of identifying outlier values, transforming variables, accessing missing values, etc.
Executing the model and the results analyzed to make some changes to the model (continuous process until the best results are reached)
Verifying the model with a new data set
Applying the models and tracking the results to analyze the performance

3. Q: Can you differentiate the validation set with the test set?
Ans: A validation set is a segment of the training set that is used for parameter choice and also for preventing the overfitting of the machine learning model in the process of the development stage. On the contradictory, a test set is used for testing or analyzing the performance of a trained machine learning model.

4. Q: Explain the role of data cleaning in the data analysis process?
Ans: Data cleaning is a challenging exercise because of the fact that with an increase in the number of data sources, and the time needed for cleaning the data increases at an exponential rate. This is only due to the huge volumes of data being generated from additional sources. And also, the data cleaning task can take merely up to 80% of the total time needed to carry out a data analysis task. However, there are many reasons for the use of data cleaning in data analysis. The most important two reasons are: Cleaning data from various data sources helps in changing the data into an easy workable well data format and further it increases a machine learning model accuracy.

5. Q: Explain the number of clusters in a clustering algorithm?
Ans: The main objective of clustering is to group all together similar identities where all the entities within the group are alike and the groups remain different from each other. Generally, the sum of squares can be used for explaining the similarity within a cluster. To determine the number of clusters in a clustering algorithm, WSS is employed for range regard to a number of clusters. The final resulted graph is known as the Elbow Curve. The Elbow Curve graph consists of a point that depicts the point post which means further there is no decline in the WSS. Hence this point is known as the bending point and it is represented with K in the K–Means. Although the prior one is a widely used method, another significant approach is Hierarchical clustering. In this approach, dendrograms are built first, after that distinct group is identified.

6. Q: What do you know about cluster sampling and systematic sampling?
Ans: Cluster sampling is a technique used when studying the target population spread all over a broad area becomes difficult and employing random sampling is also ineffective. A cluster sample is a likelihood sample in which each sampling unit is again a cluster or collection of elements. However, in the case of the systematic sampling technique, elements are selected from an ordered sampling frame. The elements list is arranged in a circular loop. That means once the end of the list is reached, the same process is going forward from start or top again.

7. Q: Can you differentiate overfitting and underfitting?
Ans: To make a genuine forecast on general untrained data in machine learning and statistical models, it is essential to fit the model with trained data. While doing so, there will be a chance of occurring errors and the most common modeling errors are overfitting and underfitting. Few important comparisons between overfitting and underfitting are as follows:

Definition: A statistical model affected by overfitting that means some random noise or error in place of the basic relationship. In the case of underfitting occurrence, the machine learning algorithm or a statistical model fails to catch the underlying trend of the data.
Occurrence: When a machine learning algorithm or statistical model is too much complex, then it results in overfitting. An example of such a complex model is one having too many parameters when compared to the total number of observations. Simply, underfitting occurs when testing to fit a linear data model to no-linear data.
Poor Predictive Performance: Even though both overfitting and underfitting result in poor predictive performance, the way in which they each perform is different. While the underfit model under-reacts to bigger fluctuations in the training data, the overfitted model overreacts to minor fluctuations.

8. Q: What do you mean by Selection Bias? What are its various types?
Ans: Selection bias is generally related to research where it doesn’t have any arbitrary selection of participants. It is also a type of error that occurs due to researcher decisions in deciding who will be going to study. Sometimes selection bias is also called a selection effect. In simple words, selection bias is a misrepresentation of statistical analysis that resulted due to the sample collecting method. Neglecting this selection bias may result in inaccurate conclusions from the research study. Various types of selection bias are as follows:

Sampling Bias: A systematic error resulting due to a non-random sample of a populace causing certain members of the similar ones less included than others that results in a biased sample.
Time Interval: Due to ethical reasons, the trial might be ended at a maximum value, but the maximum value is to be reached by the variable with the high variance, even though all the variables have alike mean value.
Data: results when certain data subsets are chosen to conclude or reject bad data randomly.
Attrition: Caused due to attrition, i.e. deducting trial subjects, loss of participants, or tests that didn’t run to finish.

9. Q: Explain the main aim of A/B Testing?
Ans: A/B testing is nothing but a statistical hypothesis testing that meant for an unexpected examination with two variables, A and B. The main focus of A/B testing is to expand the possibility of a favorable result by recognizing any alteration to a webpage. A highly proven method for searching the best promoting approach and online marketing for a business, A/B testing can be implemented for testing all, starting from sales emails to search Ads and website copy.

10. Q: What do you mean by outlier values and how do you use them?
Ans: Outlier values, or simply known as outliers, are special data points in statistics that don’t belong to a definite population. An outlier value is an uncommon inspection that is much different from other values within the set. Spotting outlier values can be done by using multivariate or some other graphical analysis methods. Some outlier values can be evaluated individually but in case of evaluating a large set of outlier values needs the replacement of the same with 1st or 99th percentile values. Two desired ways are there to treat outlier values:

To change the value so that it can be brought within a range and
To simply remove the value ( note: not every extreme value are the outlier values)

11. Q: What do you understand by Deep Learning?
Ans: Deep Learning is a prototype of machine learning that represents the high range of analogy with the operating of the human brain. It is a neural network practice built on convolutional neural networks. Deep Learning has wide applications such as social networking filtering, medical image analysis, and speech recognition. Even though deep learning is evolved in the past, but recently only it has gained worldwide popularity because of two main reasons: Huge data generation from diverse sources and increased use of hardware resources to execute Deep learning models. Keras, Caffe, Microsoft Cognitive Toolkit, Chainer, TensorFlow and Pytorch are some most trending Deep Learning frameworks.

12. Q: What do you mean by linear regression and logistic regression?
Ans: Linear regression is one type of statistical technique in which the outcome of some variable Y is forecasted on the base of another variable outcome X, called as the predictor variable. Whereas the Y variable is called as a criterion variable. Logistic regression is also known as the logit model, is a statistical technique to predict a binary outcome from a linear combination of predictor variables.

13. Q: Please explain Gradient Descent?
Ans: Gradient if defined as the rate of change of the output of a function with respect to change in its input. It calculates the alterations in all weights with respect to error modifications. The term gradient is also referred to as the slope of the function. Gradient Descent is an optimization algorithm that is used iteratively to find a local minimum of a differentiable function.

14. Q: How do you measure the Sensitivity of machine learning models?
Ans: In machine learning models, Sensitivity is used for verifying the correctness of a classifier such as Random Forest, Logistic and SVM. Sometimes sensitivity is also called as TPR (true positive rate) or REC (recall). Sensitivity can be determined as the ratio of predicted true events and total events.
Sensitivity = True Positives / Total Positives in Actual. Here the true events mean the events that are true as forecasted by the machine learning model. The worst sensitivity is 0.0 and the best sensitivity is 1.0.

15. Q: Can you list out various differences between Supervised and Unsupervised Learning?
Ans: Supervised learning is one type in machine learning, in which a function is deduced from labeled training data. The training data comprises training examples set. On the other hand, unsupervised learning is another type of machine learning, in which conclusions are pulled out from datasets containing input data without labeled outcomes. The following are the various other differences between the two types of machine learning:

Algorithms Used: Supervised learning algorithm uses Decision Trees, Neural Networks, Support Vector Machines, K-nearest Neighbor algorithm, and Regression. Unsupervised learning makes use of Anomaly Detection, Neural Networks, Latent Variable Models and Clustering.
Enables: Supervised learning enables regression and classification, whereas unsupervised learning enables density estimation, classification, and dimension reduction.
Use – Supervised learning is used for prediction, whereas unsupervised learning used in the analysis.

16. Q: How does Backpropagation work? Also, State its various variants?
Ans: Backpropagation is a training algorithm that is used for multilayer neural networks. Employing the backpropagation algorithm, an error can be moved from the network ends to all weights inside the network. This further results in efficient gradient computation. Backpropagation works in the following way:

Forward propagation of training data
Output and the target is used for computing derivatives
Calculating the derivative of the error with respect to changes in output
Using earlier computed derivatives for output generation
Updating the weights

Following are the various variants of Backpropagation:

Batch Gradient Descent – The gradient is calculated for the whole dataset and updating is performed on the individual iteration process.
Mini - batch Gradient Descent: For updating parameters and to calculate the gradient, Mini-batch samples are used.
Stochastic Gradient Descent – a single training data is used to update the parameters and to calculate the gradient

17. Q: Can you explain Recommender Systems with its one application?
Ans: Recommender Systems is a subcategory of information filtering systems, mainly used for forecasting the choices or ratings given by a customer to the product. One of the recommender system applications is that the product recommendations part in Amazon. This portion contains a list of things found on the customer’s search history and previous orders.

18. Q: Can you explain the concept of a Boltzmann Machine?
Ans: A Boltzmann Machine is a simple learning algorithm used to discover interesting facts about complex consistencies available in training data. Normally it is used to optimize the weight and quantity for some given issue. Boltzmann Machine consists of a simple learning algorithm that works slowly in the case of a multilayer network of feature detectors.

19. Q: Explain Eigenvalues and Eigenvectors?
Ans: Eigenvalues are a particular set of scalars related by some linear transformation in the direction of the eigenvectors by which the modifications occur. Eigenvectors help to understand the linear transformations. They are determined classically for a covariance or concurrence matrix in data analysis. In simple words, eigenvectors are a unique set of vectors that associated with some specific linear transformation by stretching, flipping, or compressing the vectors.

20. Q: What do you know about Autoencoders?
Ans: Autoencoders are superficial learning networks mainly used to transform inputs to outputs with less noise. Hence, the resulted outputs are very close to input values. A pair of layers is attached between the input and output with the size of each layer will be smaller than the size concern to the input layer. An autoencoder receives untagged input that is encoded for rebuilding the output.

This is a complete list of 20 important Data Science interview questions and answers. We hope you found useful information to prepare well for the interviews to have a successful career in the Data Science path.

Wish you good luck!

Data Science Interview Questions with Answers - Download