Psychometrics: Validity
Concerned with what the test measures and how well it does so
- How well is the test measuring the domain of knowledge that it is supposed to measure?
- It measures systematic errors--something specific about the test is missing.
- This is unlike reliability which measures random errors.
Example of Systematic Error
The factor of language within a test.
The test is supposed to measure coordination, but there is a big language component because the person has to read and understand the instructions. Therefore, the systematic error is that language is being tested along with coordination.
Sometimes this happens with educational tests where the professor uses unusually difficult language, thereby reducing the validity of the test.
To improve the validity of a test, you try to reduce the systematic errors.
Note that you can have good reliability and poor validity.
For example, a test can measure something consistently, but not be accurate.
However, it does no good to have a test that is reliable, but with poor validity.
Conversely, you don’t want a test with good validity and poor reliability.
This type of test would measure a given trait, but could not be counted upon to measure it consistently.
Types of Validity
The types of validity studies done depend on the purpose of the test
The four types of validity are:
Content
Criterion Related
Concurrent
Predictive
Construct
Face
Content Validity
This type is mostly concerned with achievement tests
It is used specifically when a test is attempting to measure a defined doman
Content validity indicates whether or not the test items adequately represent that domain.
Content validation begins during the development of a test.
Usually a Table of Specifications is built in order to ensure that the entire domain is represented by the items in a test.
Table of Specifications
Miller Assessment for Preschoolers
- The table of specifications compares subtests or specific items to the behavioral domain being tested.
- In the previous slide we see 3 behavioral domains and four subtests.
- The x’s in the boxes tell you that those domains are being tested.
- In the end, all domains should be represented in the test.
- Another method of content validation is by using experts in the field
- The test is sent out to experts who review the test and the domains to be evaluated.
- This is used in conjunction with the table of specifications.
When is it Appropriate to do Content Validation?
Appropriate for:
Achievement tests
Tests related to occupation: employment, classification, job tasks.
Inappropriate for:
Personality tests
Aptitude tests
(no specific domain of knowledge to be tested)
Criterion-Related Validity
Indicates the effectiveness of a test in predicting an individual’s performance on specific activities.
Performance is checked against a criterion
Criterion: A direct and independent mesure of what the test is designed to predict.
Example: for a test of vocational aptitude, the criterion might be job performance
There are two types of criterion-related validity: concurrent and predictive. These two types are differentiated by the time period between the test and the criterion.
Concurrent: Short time period between test and criterion.
Predictive: Long time period between test and criterion.
Concurrent Validity
Example: A test is developed to identify individuals with tactile hypersensitivity.
Using concurrent validity, the goal would be to see how well the test can identify who has tactile hypersensitivity.
Two groups of subjects are needed: one group known to have tactile hypersensitivity, one group known to have normal tactile function.
The test is then given to both groups of people.
If the test has concurrent validity, it will accurately identify those who have tactile hypersensitivity and those who are normal.
Look for a high classification rate: perhaps 90% or so.
Another method of concurrent validity:
One group of subjects takes the newly developed test, and also takes a test that is established in the field.
The results on these two tests are compared
Hopefully the new test will be as accurate as the old test.
Example Using the
MAP and DDST
What Does This All Mean?
You can see that about 70% of the kids were classified as normal by both tests.
However, 22% that were classified as normal by the DDST were classified as questionable by the MAP.
3% of the kids who were questionable on the DDST were normal on the MAP
Is the MAP a Better Test Than the DDST?
It appears that the MAP identified 24% more kids who potentially had problems (22% in the yellow- questionable category and 2% in the red-abnormal category)
However, predictive studies are needed when these kids reach school age--to see if the MAP was accurate.
Predictive Validity
Similar to concurrent validity, but with a longer time period between testing and measurement of criterion.
Need a group of people who can be studied long term. This is called a longitudinal study.
Everyone in the group is given the test and scores are tabulated.
Then you wait a period of time--usually months or years, but it could be shorter depending upon what is being measured.
After a period of time, the criterion is measured. In the example used previously, after time do the children in the group develop tactile hypersensitivity?
The test that was given previously is then compared to their hypersensitivity status.
Did the test accurately predict which kids would develop hypersensitivity?
If so, then the test has predictive validity
Predictive validity takes a long time to establish and a test may have predictive validity studies going on for years.
In the meantime, concurrent validity studies help establish that the test does indeed measure a specific criterion.
Construct Validity
Construct: an unobservable trait that is known to exist.
Examples: IQ, motivation, self-esteem, motor planning, anxiety
How can we be sure a test measures these constructs if we can’t directly measure them?
Ways to assess construct validity include:
Developmental changes
Correlations with other tests
Factor analysis
Convergent and discriminant validation
Developmental Changes
Employs the idea of age differentiation
Useful for any test that is developmental in nature.
Since abilities increase with age (during childhood), it is logical that test scores will also increase with age
This is one measure of construct validity, but is not conclusive.
In other words, determining construct validity by developmental changes alone is not sufficient.
There needs to also be other measures of construct validity.
Correlations with Other Tests
Correlations between the new test and established tests helps to establish construct validity.
The idea is that if the correlation is relatively high, the new test measures the same traits as the established test.
Want moderate as opposed to very high correlations (If correlations are very high you have to wonder if the new test is really necessary?)
Factor Analysis
FA is a multivariate statistical technique which is used to group multiple variables into a few factors.
In doing FA you hope to find clusters of variables that can be identified as new factors.
Example: Factor Analysis
In the standardization of the Miller Assessment for Preschoolers (MAP) over 1000 children were tested using that assessment.
The FA study looked at the interrelationships between the various subtests and came up with 6 primary factors.
Factor Analysis of the MAP
Convergent and Discriminant Validation
The idea is that a test should correlate highly with other similar tests, and
The test should correlate low with tests that are very dissimilar.
Example
A newly developed test of motor coordination should correlate highly with other tests of motor coordination.
It should also have low correlations with tests that measure attitudes.