Redistricting with Genetic Algorithms

Welcome to my redistricting experiment. Let’s talk a bit about what I’m trying to do here before diving into the actual results.

What is redistricting?

Every 10 years, the US government is required to perform a census of everyone living in the country. Obviously, this isn’t really the case, but they have mathematical models for estimating how many people were missed. Ultimately, the result is an official (if not real) count of how many people live in particular areas.

So what is this used for? Well, it helps allocate funds for government programs and help governments at all levels understand their constituents. It also helps us handle the distribution (or apportionment) of congressional representatives.

Once the state-level counts are ready, an algorithm is used to assign representatives to each state. Everyone starts with 1, and states are given more if their population warrants it. Currently (meaning, after the 2020 census), 6 states only have 1 representative: AK, ND, SD, WY, VT, and DE. Montana actually escaped the 1-representative group and will now have 2.

States who have gotten more or less representatives need to redraw their districts to accommodate that. States that have not changed the number of representatives also have to redraw districts as their populations have probably shifted over the last decade.

Why are you doing anything with redistricting?

Well, there are plenty of recent court cases around how districts are drawn. A couple states have had their maps thrown out because they were too gerrymandered to be considered constitutional. Gerrymandering happens because many states allow the legislature to draw the map. Therefore, they are effectively drawing maps to ensure they keep their seats. Some states have independent commissions, but they also have to follow rules set by the legislature.

What is a good district map, then?

I’m not going to make a value judgement here. There are lots of criteria that go into making these maps, and my maps may not follow them. Some states try to group people similar concerns (whatever that means). Some states require that an entire county be in the same district if at all possible. There are different tolerances for population differences, and how many people end up in a different district from the one they were in at the last election. The list can go on and on.

There are a couple simple criteria that most (when it doesn’t affect their goals) states aim for: even population and compactness. These are the metrics I am using.

So what did you do?

I have a (fairly rusty) background in genetic algorithms (GA). This is where we allow computers to somewhat randomly search for solutions to problems that are too difficult to find exhaustively. I’ve taken this concept and applied it to building congressional districts. Here’s the general process followed for each state.

1. Get the statewide shapefiles for the block groups and blocks in the state

The census divides up the country into several levels. In the 2020 census, they did some fuzzy assignments of people to census blocks to preserve privacy, since census blocks are really small. Blocks are grouped into block groups. Block groups are also grouped into tracts, which are segments of counties. Most of the work has been done with block groups, but I might use blocks if I have time.

The shapefile contains the physical shape of each region. I take that and convert it into GeoJSON, which is a more friendly format for our purposes.

2. Get the census data

I also get the census data for the state. The census data comes in a pipe-separated (|) file with tons of fields. I am focused on summary level 150 (State-County-Tract-Block Group) and 750 since those corresponds with the shape files. The key pieces of information here are the ID, internal point, and population.

3. Merge together census data and GeoJSON

Next, I merge the two data sources together so that I have entries for each block group with their internal point, population, geometry, and which other block groups are adjacent. The adjacency does require some tweaking as states like CA, NY, FL, and HI are not completely connected.

4. Run the genetic algorithm

Now, I’m ready to run the algorithm. For those who care, here are some details about the options I used with the algorithm.

Generational Solver

I used a generational model with strict selection. Each generation is 1000 individuals. There are 500 survivors at the end of the generation, chosen by tournament (i.e. pick 4 and drop the worst 2). I run until I have 120 generations without an improvement.

Crossover

I use a single cut crossover of the gene. This means that I choose a random point in the gene to cut the two parents. I swap the ends (or beginnings, depending on how you look at it) and create two new genes.

Gene

The gene is a list of integers. Where r is the number of block groups and i is in index into the segment, each value can have a value from 0 to r-i-1 inclusive. Take as many segments as there are districts in the state, and you have your gene.

Here’s how the fitness function works:

Create a sorted list of all the block group IDs
For each value of the gene, remove the block group ID at the index specified by the value and add it to the priority list.
For each district, find the first unclaimed value in the priority list and use that as the starting point.
For each region in the order, find the adjacent district which would make the best fitness and assign it.
Repeat until no changes have been made.

Mutation

Another element of GA is mutation, so that we can try to avoid getting stuck at locally-optimal results. About .001% of all child bits from a crossover are mutated. A mutated bit is given a random (but legal) value. If the population seems to be stagnating, I will double the mutation rate every 10 generations until it hits 50%. It will stay there until I end the run or a new hero is found.

Fitness Function

The fitness function is where the magic happens. After block group assignment is completed using the specified strategies in the Gene section above, I need to determine how good the solution is. The fitness function creates several values.

population score - This is the sum of the absolute difference between each district population and the target population.
density score - In order to get compactness, we look at the density of the smallest districts. A district is small if it is one of the smallest (by area) 50% (rounded down) of the districts OR if it should fit completely into a single county. If the total number of people in the small districts is less than the target population for that number of districts, we pad the value with the state’s average density. The resulting density of all the small districts is compared to another value, which we’ll call the “observed maximum density”. These values (except for Idaho) are the result of running the GA while only trying to optimize the density score (except against the theoretical best density). The score itself is the percentage of the observed maximum density that is represented by this map’s small districts.
shape score - This is a simple edge cut ratio. We take the number of edges between regions inside the district, subtract the minimal number (n - 1), and divide it by the total edges minus the minimal number.
county score - The ratio of counties containing the correct number of districts over all counties.

The density, shape, and county scores are all scaled to be out of a million (1e6) to simplify the precision and use integers (which are faster) as much as possible.

When the fitnesses are compared, the actual comparison involves popScore * -40 + shapeScore * 5 + densityScore + countyScore.

5. Run the genetic algorithm again

Once the GA has run, I actually run it again. This time, the gene is the assignments of each block group to a given district. The first run uses a list ordering gene that can have lots of side effects whenever a single value is changed. This run is more about tweaking the individual block groups. This is run with a much lower mutation rate (1/100) and a larger population (10x). The idea here is to preempt some of the hillclimbing work to come as well as search the solution space without being so single-minded.

6. Run the hill-climbing algorithm

Once I have my best results from the genetic algorithm, I post-process the best one to see if I can make it any better. I do this by trying to change one census block group at a time from its current district to a neighboring district (if it is on the border). I also go one step further and try to get another block group into the district that just lost a block group. This can give me a slightly better result by just moving a couple districts over the line.

Files

For each state and/or redistricting map, there are files associated with them.

Data file

The data file is a zip file containing the GeoJSON for the state as well as the census data for the state. There is an additional file that does some manual assignment of adjacency for block groups or clusters of them that are not physically connected to the rest of the state (i.e. islands).

Processed file

This file is a gzipped JSON file of the processed data. This file ultimately powered the entire GA. It is a list of objects with these fields:

id - the GEOID that correlates the GeoJSON file with the record in the census data block group
summary - the summary level (this is always "150")
coords - the latitude and longitude of the internal point from the census data file
count - how many people live in the block group
shape - the GeoJSON geometry for the block group
adjacentRegions - a map where the keys are block group IDs and the values are the distance between the internal points
outerEdge - boolean indicating whether this block group touches the outer edge of the state

Output file

The output file is a zip file containing a bunch of information about the result.

all-districts.geojson - a GeoJSON of all the districts without their constituent block groups
district.*.geojson - GeoJSON files for each individual district including their block groups identified by the GEOID
fitness.json - the fitness of the result
gene.json - the gene that produced this result (not included in hill climbing as it was possibly created from the result of the gene rather than the gene itself)

What is the Wyoming Rule?

The Wyoming Rule is a proposal to expand Congress, based on the size of the smallest state. Luckily, someone did the calculations for the Wyoming Rule on Wikipedia. It also affords me the opportunity to get Puerto Rico in on this, as if it were included. Washington, DC, is still too small for more than one district.

Any states that don't have a Wyoming Rule layer don't have any change in representatives from the current Congressional apportionment.

What is the Cube Root Rule?

Similar to the Wyoming Rule, the Cube Root Rule is a way to decide the size of congress. There are a lot of modern democracies whose lower chambers are about the cube root of the population. Unfortunately, I haven't found anyone who has done the work on this, so I did the calculations for the cube root option myself.

Any states that don't have a Cube Root Rule layer don't have any changes in representatives from thw Wyoming Rule. Similarly, the Cube Root Rule + DC + PR adds DC and Puerto Rico. This adds 3 more seats. In order to give 7 seats to Puerto Rico and 2 to DC, we need to take 6 away from elsewhere, namely California, Connecticut, Illinois, Nevada, Ohio, and Texas.

Conclusion

That’s it. I’m not going to say these are great redistricting maps. They are quite unbiased, though. I am willing to throw away maps that are just hugely weird or where districts are not REALLY contiguous. Seriously, though, this is just an exercise and not meant to be an endorsement. Let’s start conversations so we can avoid the “Goofy kicking Donald Duck” districts.

Population score
Density score
Shape score
County score