Racial and Ethnic Diversity
On August 4th the Census Bureau announced that the soon-to-be-released 2020 Census data will include a “diversity index”1. They state,
“One of the measures we will use to present the 2020 Census results is the Diversity Index, or DI. This index shows the probability that two people chosen at random will be from different race and ethnic groups.”
Although they never say it by name, from the description it is apparent that the chosen DI is something called “Simpson’s Diversity Index” (more simply, just Simpson).
Race and Ethnicity
Although ethnicity can mean any ethnicity, from a practical perspective it is limited to Hispanic in most data releases. Hispanic is like French or Japanese. It is not a race.
The Census Bureau has these seven racial categories:
White
Black or African American
American Indian and Alaska Native
Asian
Native Hawaiian and Other Pacific Islander
Some Other Race
Two or More Races
A person of Hispanic ancestry can be ANY of those listed races. I will examine race & ethnicity from a reporting perspective in detail in a future post. For now, what is important to understand is the DI will have EIGHT categories.
The eighth is Hispanic. That means, to avoid confusion, it is critical not to say “White” but rather “Non-Hispanic White”. Similarly Black is “Non-Hispanic Black” and so on. It sometimes is termed “White alone, non-Hispanic”, “Black alone, non-Hispanic” etc.
Diversity Indices
There is a vast body of work in biology (especially ecology) on diversity and diversity indices. Most discussions involve Simpson and another called “Shannon” (there are others). The math gets complicated. For a good review see this website by Lou Jost.
There are two critical items in measuring diversity
Richness - the number of species (more is better)
Equitability - how equal in number are the species (5 species at 20% is better than 1 species at 96% and 4 at 1%)
Simpson, as seen in the Census quote above, is a probability measure. Shannon is an uncertainty (or entropy) measure. This means that they don’t really measure “diversity”; they measure probability or entropy. It is true that a region with a bigger Simpson or Shannon number has a “larger diversity index”. But if you want to compare multiple communities, or review a region over time, you must convert these indices to something called the “effective number”.
The Effective Number
The effective number takes any diversity index and converts it into a species count where each species has equal numbers. Remember, for equitability, equal numbers is best. With equitability optimized, regions can be compared.
A region with an effective number of 3 truly is twice as diverse as a region with an effective number of 1.5. You can do this math because the effective number is linear. Simpson & Shannon are NOT linear meaning you cannot compare them this way.
If you want to say Region A is X% more diverse than Region B you simply must use the effective number.
Why Pick One Over the Other?
We don’t know why Census picked Simpson. But we can at least talk about the advantages and disadvantages they present and perhaps speculate.
The concept of probability - Simpson - is a somewhat intuitive concept. Saying you have a 10% chance of seeing someone from a different race vs a 70% chance is instantly understandable by most people. National media often use Simpson because of this ease of understanding.
Shannon is far harder to understand. Entropy is not an easy concept. It is a number few will understand beyond “bigger is better”. This makes it not terribly useful for a general audience.
So we can understand why Simpson would be chosen over Shannon.
Perhaps the complexity of understanding the effective number explains why it too was not picked. It does bring up an interesting question, though. Why not produce Simpson, Shannon and their associated effective numbers? Serve people who need simplicity and those who need a more complex (and appropriate) method.
This is exactly what I have done for StateBook using raw population estimates data. (This process will be replicated once the 2020 Decennial Census data is released). It allows flexibility. If Simpson is an appropriate measure we have it. Same with Shannon. But if you are trying to compare multiple regions (or conduct a time series) - as is the case with most StateBook users - then you MUST convert to the effective number which we also produce.
To reiterate:
Either DI works if you want to state that one region is more diverse than another
Simpson is your choice if you want to talk about probabilities
Shannon is incredibly useful; it requires special knowledge to interpret properly
But if you want to quantify differences, if you want to say one region is twice as diverse as another region, you must use the effective number