statistical data analysis

Monday, March 5, 2012

How to identify the most influential variable in the data

Introduction:

Many research studies have aimed at identifying the most influential variable for the given dependent variable and it very important if the numbers of independent variables are large in number. Always models with large number of independent variables will cause over fitting problem as well as reduced model efficiency. So one can be much interested in knowing the list of most influential variables by which he can draw some meaningful conclusions about the dependent variable.

In the following sections, we briefly explained some of the techniques to identify the most influential variable in the data.

Influential variable by using the study of partial correlation

In the multiple regression study, one can trust on semi partial correlation coefficient and normal correlation coefficient will throw good light on variable importance. The squared semi-partial correlation indicates the unique proportion of variance explained in the outcome variable by the target predictor over and above the other predictors involved in the study.In SPSS we can get the partial correlation directly as shown below.

Using regression Coefficients for influential variable:

In simple or multiple linear regression, the size of the beta coefficient for each independent variable gives you the size of the influence that variable is having on your predicted variable, and the sign on the coefficient (positive or negative) gives you the direction of the effect. In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to change ( Increase if the coefficient is positive or decrease if the coefficient is negative) when that independent variable increases by one. In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by single unit, keeping all the other independent variables as constant. Here important point to keep in mind is that the units of measurement of variables. It is assumed here that all the variables are measured in uniform units.

Partial R-square Value:

The partial R-square value will give the good idea of how much variability in dependent variable is covered by the each of the independent variable. The greater the value of partial R-square value will gives the impression of most significant variable in the current multiple regression study. The SAS system through stepwise regression provides the partial R-Square value for each of the independent variable in the following form

Number Partial Model

Step Label Vars In R-Square R-Square C(p) F Value Pr > F

1 height 1 0.4873 0.4873 470.186 475.23 <.0001

2 Flow 2 0.0908 0.5781 300.778 107.35 <.0001

3 Speed 3 0.0528 0.6309 203.072 71.23 <.0001

4 Pressure 4 0.0238 0.6546 160.218 34.18 <.0001

Other methods for influential variable:

Many researchers are used several other techniques also to determine the most influential variable in the data depending on the type of study and data availability. Some of them are used principle component analysis to get most important variable in the data. For example, the greatest coefficient in the first principle component will hints out the most influential variable from the given set of independent variable.

In some other cases, one can go for graphical techniques like Added variable plots or partial regression plots to get an idea of most influential variable. But in all of the above cases, the crucial part is the researcher’s knowledge about data and variables and the interpretation skills. No particular technique is suitable for all types of scenarios and hence one can get good idea after practice only.

Wednesday, July 13, 2011

Frequently asked questions in Common multi variate techniques

Many a times, some of my friends asked about frequently asked questions in multivariate techniques at interviews, conferences etc. But, infact these questions do not have any limited scope but i tried a littel in this way here is my prefered questions.

Multi variate techniques:

Multiple regression analysis:

What is difference b/n multiple regression and multivariate regression
How to select inde variables in to the system?
What are the measures of efficiency?
What is the specific proc in SAS
Assumptions underlying and the consequences of their violations.
Estimation techniques, adv and dis adv

Logistic regression analysis:

Difference b/n logistic and traditional regression
Assumptions if any?
Estimation method?
Efficiency measures?
Which domain having major applications
Odds ratio implementation
Tests of goodness of fit

Descriminant analysis:

What is the aim of descriminant analysis
Methods of constructing descriminant functions
Fisher discriminant function
Issue of multi collinearity here
What is cluster descriminant
Domain applications

MANOVA:

Tests of MANOVA
Structure of model in manova
Assumptions of MANOVA
How to use and read in SAS environment
Difference b/n manova and multiple regression
PROC MANOVA

Factor analysis:

What is the difference between PC and PAF (Principle factor analysis?)
What is a Simple or Clean Factor Structure?
Types of factor analysis
Applications in manufacturing
PROC PRINCOMP, how to improve the performance
Type of conclusions in FA

Multi dimensional scaling:

Types of multi dimensional scaling
How to decide on what dimensions respondents use when evaluating objects
how many dimensions they may use in a particular situation
test for the relative importance of each dimension
how the objects are related perceptually

Correspondence analysis:

How to use it in market research?
How to read the parameters?
Any significant tests
How to apply in SAS

Conjoint analysis:

Advantages in market research
Types of conjoint analysis
Relation with regression and logistic regression
Latest developments
Steps in the design of the studies

Cluster analysis:

Why it comes under multivariate techniques
How to choose the variables for the clustering
What are the types of clustering?
Measures for efficiency of clustering
Reports based on clustering

Canonical correlation:

Why it is important than usual correlation
What is the complexity involved here
Application area

Structural equation modeling:

Why it is so significant
Applications in SAS
How to interpret the results

Wednesday, June 22, 2011

Regulatory science-some concepts

Def of Regulatory science:

It is science dealing with innovative methods and tools to asses the safety, quality and efficiency of FDA products.

Where is the role of regulatory professional lies in:

It begins with R&D Phase
Moves in to clinical trials analysis
It extends to pre market apporvals

Thursday, March 3, 2011

Steps in data analysis:

The following are general steps in in data analysis:

1. Requirement analysis

2. Formulation of hypotheisis

3. Designing the survey

4. Data collection/data tabulation.

5. Performance prescribed analysis and sample data.

6. Evaluate results and carry analysis on full data.

7. Tabulate results and conclusions.

8. Limitations and Assumptions if any.

We will see explnation on them in the next post.