BioGeoSim, Acquired Intelligence
Acquired Intelligence, Inc., 1987-2007, updated October 16, 2007
BioGeoSim Main Help

BioGeoSim Help & Tutorial


  1. Introduction to BioGeoSim
  2. Data Format
  3. Defaults
  4. Options
  5. Output
  6. Caveats
  7. BioGeoSim Tutorial

1. Introduction to BioGeoSim

BioGeoSim was designed to simulate the stochastic origin and spread of species geographic ranges in a heterogeneous landscape (Rahbek et al. 2007). The landscape is represented as a gridded domain map, with the rows and columns representing the geographic coordinates of the domain. Entries of 0 indicate grid cells that are not part of the domain (e.g., ocean or water grid cells for a terrestrial domain). Positive real-number entries represent grid cell values of environmental variables (e.g., temperature or NPP), measured for each grid cell in the domain. Each species geographic range is mapped in a separate gridded file, with 1s and 0s corresponding to presence or absence of a species at a grid cell in the domain.

BioGeoSim uses these data to simulate the stochastic placement of each species within the domain. In simple models, all grid cells are equiprobable. In more complex models, the probability of occurrence in a grid cell is proportional to the environmental variable in that grid cell. Geographic ranges may spread through strict range cohesion, complete scatter of occurrences, or a Poisson dispersal model, in which each subsequent cell is colonized by dispersal from an occupied cell.

Once all species have been distributed in the model, BioGeoSim calculates the observed species richness in each cell. After a given number of stochastic replications, BioGeoSim calculates the average species richness per grid cell. These expected values are then used in an additional set of stochastic iterations to calculate a goodness-of-fit deviation and compare it to the deviations from the observed data (i.e., a classic null model test).

Finally, BioGeoSim calculates a simple linear regression statistic of observed versus predicted species richness for all grid cells. If the model fits the data perfectly, the fitted regression line would have a slope of 1.0 and an r2 value of 1.0.

[ Top of Page ]

2. Data Format

The format for the data is somewhat complex, and there are several files that are involved.

First, BioGeoSim requires an 'Area Map.' This is a rectangular matrix, with an entry in every cell. A 0 indicates a grid cell that is not in the domain (such as an aquatic grid cell for a terrestrial domain). Thus, although the grid is rectangular, the domain can take any shape. Each cell in the domain is filled with either a 1 (if there is no further information) or a non-negative real number, indicating the measured value of an environmental variable (such as NPP, or temperature) or some composite index of environmental suitability (such as a PCA score derived from several environmental variables). Negative numbers and missing values are not allowed.

Second, there is a set of 'Species Maps.' A separate map is required for each species, and this map must have the same dimensions as the 'Area Map.' For each species, the map indicates the presence (1) or absence (0) of a species in a grid cell. The occurrence of each species must be in one of the non-zero grid cells of the Area Map. BioGeoSim will abort the analysis if any of the species maps are inconsistent with the area map.

The species maps must each contain a unique name FileName.txt, and they all must be placed in a single folder that does not contain any other files. The Area Map can be stored anywhere else on the computer, but not in the same folder with the species maps.

[ Top of Page ]

3. Defaults

Because BioGeoSim can take a long time to run with large data sets, we suggest you set the default values to something artifically low: 30 replications (for creating the map of expected species richness) and 10 iterations (for the null model test of goodness of fit between the observed richness per grid cell and the expected richness that is generated by the 30 initial replications). Once you are satisfied that the program is working properly, we suggest 300-500 or more replications, and 1000 iterations (if you want to use the results of the null model test).

The default values are for equiprobable colonization and spread into adjacent grid cells. Strict range cohesion is maintained, and the range expands by randomly filling one of the four possible adjacent grid cells, as in a rook's move in chess. This is a pure 'spreading dye' model (Jetz and Rahbek 2001), which, for most data sets, will generate a mid-domain peak of species richness near the center of the domain. The default setting is to use the entire domain space (all non-zero cells in the Area Map) for the stochastic placement and spread of geographic ranges.

[ Top of Page ]

4. Options

Load Area Map

This button opens an open file dialog, allowing you to select the text file that contains the area map. Once this file is open, you can inspect the map on the Area Map tab.

Create Species Richness Maps

This button opens an open folder dialog, allowing you to select the folder containing all of the species' range maps. Once these files are opened and processed, you can inspect the list of species on the Species List tab. After each species name is the number of grid cells occupied in the domain. The Species Richness tab shows the map of tabulated species richness, which is based on the individual species' maps.

Number of Replications

This number determines the number of model replications that will be used to estimate the average species number per grid cell.


There are four options for specifying the domain of the simulation: 1) Full Domain: species can be potentially placed in any non-zero grid cell of the Area Map; 2) Occupied Cells: species can potentially placed only in non-zero grid cells that contain at least 1 species in the observed species richness map; 3) Latitude: species can be placed in any non-zero grid cell that is within the range of the maximum and minum latitude (= grid row) of observed species occurrences; 4) Longitude: species can be placed in any non-zero grid cell that is within the range of the maximum and minimum longitude (= grid column) of observed species occurrences.

Seeding Algorithm (first cell)

For each species' geographic range, the first cell colonized can be chosen either equiprobably, or the probability of occurrence can be scaled proportional to the measured map value in the Area Map.

Colonization Algorithm

This algorithm controls the spread of each species' geographic range after the first cell has been established. Ranges can either spread equiprobably into unoccupied cells, or they can spread with probabilities proportional to the measured map value in the Area Map.

Dispersal Algorithm

There are three choices for the algorithm by which each species' geographic range expands. With Range Scatter, the occurrences are spread randomly and independently within the chosen domain, and there is no restriction on range cohesion. With Contiguous Dispersal, ranges can only spread into empty cells that are adjacent to currently occupied cells by the species. With this option, there is a further choice of 4-cell (rook's move) or 8-cell (queen's move) choices. Both options give similar results, but the 8-cell option is usually faster because there are more choices for range expansion at each step. The final algorithm is a Stepping Stone Model. In this algorithn, dispersal occurs at a random distance and orientation from one of the occupied cells. With the Stepping Stone Model, the user must specify a Poisson parameter, a non-negative real number that determines the mean dispersal distance (units = number of grid cells). If this Poisson distance = 1.0, the pattern will be very similar to that produced by Contiguous Dispersal, although with occasional range disjunctions. As the Poisson parameter is increased, the geographic range becomes more fragmented. Finally, at large values of the Poisson parameter, the pattern will resemble that created by the Range Scatter algorithm, with no restrictions on range cohesion.

With the Stepping Stone Model, some provision has to be made for cases in which the next step lands in a cell beyond the edge of the domain. With the Lost Colonists option, the process is repeated until all range occurrences are placed within the domain. With the Paratrooper Model, the closest cell at the edge of the domain is chosen for cell placement that go beyond the domain boundaries. These options are not available unless the Stepping Stone Model is first chosen.

[ Top of Page ]

5. Output

Area Tab

The Area tab shows you the original Area map, which outlines the shape of the domain and shows the mapped values of the environmental variable in each grid cell.

Observed(richness) Tab

The Observed(richness) tab displays the observes species richness, which was calculated from the original species richness maps that were provided as input.

Simulation(richness) Tab

The Simulation(richness) tab illustrates an example of species richness patterns produced by one iteration of the model.

Observed(species) Tab

The Observed(species) tab illustrates an example of the distribution of one of the species map files.

Simulation(species) Tab

The Simulation(species) tab illustrates an example of the distribution of one of the species map files that has been randomized according to the chosen algorithm. This tab should be compared to the Observed(species) tab, which shows the range map of same species from the original data set.

Expected matrix Tab

This tab shows the expected number of species per grid cell for the simulation model. The expected number is the mean species richness per grid cell, averaged over the number of replications that were specified by the user.

Deviation statistic Tab

In this tab, the observed index is calculated as:

SUM(Observed Species Richness - Expected Species Richness)^2

with the index being summed over all possible grid cells. If the deviation statistic equals 0.0, then the expected species richness per grid cell was exactly the same as observed. As observed richness deviates more from model expectations, this deviation statistic gets larger.

The Histogram Low/High panels show you the cut-points for bins in a histogram distribution. The number of simulations indicates the number of iterations that fell in each histogram bin.

The lower panel gives the observed deviation index, and the average deviation index of the simulations, plus its variance. Next are the lower and upper tail probabilities for those tests (one-tailed).

Regression Tab

This tab gives basic summary statistics for the regression of observed species richness versus expected species richness, with the sample size being the number of grid cells in the domain.

Summary Tab

The summary tab gives the simulation conditions, including the name of the input file, the randomization constraints, and random number seed. Next, it presents the information that was contained in the Deviation statistic tab, lists the species in the data set and shows the observed and expected matrices. All of these data can be edited, deleted, or annotated. The output can then be saved (Save to File) or discarded (Close). There is also a small time clock in the lower right-hand corner so you can tell how long your simulation took.

[ Top of Page ]

6. Caveats

The speed of the program will depend on the spatial complexity of the domain, the distribution of mapped values, and the frequency of occurrence distribution of species. If there are some species that occur in all or nearly all cells of the domain, you may be able to speed up the program considerably by eliminating those species (which do not contribute much to variation amomg grid cells in species richness).

If the domain contains spatially disjunct areas of usable habitat, it is possible to lock up the program, because a species with a large range might get trapped in a small area and never be able to expand. But for domains that are contiguous, BioGeoSim does a good job of filling distributions.

Perhaps the biggest caution is that using a simple "all-or-none" classic null hypothesis test is often not useful, because very large data macroecology data sets will always end up rejecting the hypothesis. See the discussion  of these in Colwell et al. (2004). We think the model is much more useful for deriving the expected species richness per grid cell and comparing it to the observed data with regression and other methods (see Rahbek et al. 2007).

[ Top of Page ]

7. BioGeoSim Tutorial

The BioGeoSim tutorial uses a small subset of an extensive data set on the distribution of South American birds that has been carefully compiled and checked by Carsten Rahbek and Gary R. Graves.

Open BioGeoSim and click the "Analyze" button to take you to the menu of options.

Click the "load Area map" and open the file "SA Elevational Ranges.txt."

Click on the "Area Map" Tab to take a look at this file. As you scroll down towards the center of the file, you will see a set of grid values that are in the shape of the South American continent. Each grid cells contains an entry which is the elevational range (maximum minus minimum) in meters within each grid cell. Values of 0 indicate ocean cells or land cells that are not in the South American doman.

Next click on the "create Species richness map", and navigate to the folder labeled "Phaethornis maps". You will see a listing on the right of 23 species file names, each with a .txt suffix. Click the "OK" button to load these maps.

After a few minutes, click on the "Species file list" tab. This tab gives the listing of files that were loaded. On the right, it shows the number of grid cells occupied by each species. These range from a minimum of 1 grid cell (Phaethornis nigrirostris) to a maximum of 678 grid cells (Phaethornis ruber). If any of the files did not match the domain map, BioGeoSim would have stopped its processing at this point and issued an error message.

Click on the "Species richness" tab and you can see that BioGeoSim has superimposed the individual range maps and calculated species per grid cell. The integer values range from 0 (some grid cells in the domain do not contain any species in the genus Phaethornisto a maximum of 8 species (2 grid cells located near the central-western region of the occupied cells).

Now that the data have been successfully loaded, click on the "Preferences" tab to set up the model.

You will notice two subtabs that are found on the "Preferences" tab. Click on the "General" subtab, and change the number of iterations for the null model from 10 to 100. Now return ot the "Simulation options" subtab.

For this first analysis, we will keep the rest of the default values: the domain is the full map of South America, the first cell for each species range is chosen equiprobably, and the spread of the ranges into adjacent cells is also equiprobable. the dispersal algorithm is contiguous, meaning that the range only spreads contiguously from occupied cells. The spread can occur into any one of the four adjacent grid cells. These default values generate a classic "spreading dye" model (Jetz and Rahbek 2001) in which there is random placement and spread of species' contiguous geographic ranges within a bounded domain.

Click on the "Run" button, which will start the simulation and bring up a temporary window with a progress bar. You will see the program process 30 replications to get the expected values of species richness for each grid cell, then an additional 100 iterations for the null model goodness-of-fit test. The entire simulation should run in less than one minute.

When the simulation ends, it will take you first to the "Summary" tab, which gives the full settings for your model and the simulation outputs. This is a text file that can be annotated or edited. You can save the text file at any time by clicking on the "Save" menu option and choosing "Summary."

Now we will scroll through the rest of the tabs in sequence from left to right.

The first two tabs you have already seen during the initial set-up. The "Area" tab shows the domain map and the "Observed(richness)" tab shows the species richness in each grid cell in the domain (as calculated from the original species maps).

The Simulation(richness) tab shows the species richness pattern generated by one replicate of the model. Notice that this map of values fills up the map of South America, but the cells with the greatest species richness (usually five to seven species) occur in a rough cluster near the center of the domain. Lower values of richness usually occur near the edges of the continent, although there is some stochastic variation in the simulated pattern of species richness.

The tab "Observed(species)" shows you the original range map for one of the species in the data set. You should compare this to the next tab, "Simulation(species)," which shows the map of the same species as it was generated by one of the simulations. Both maps have the same total number of species occurrences, but the shape and placement of the second map was random, and was created with strict range cohesion.

The tab "Expected Matrix" gives the expected number of species per grid cell, which is the average calculated for the 30 replications. This graph is hard to see as a text matrix. However, if you carefully inspect the values, you will see that expected species richness drops off as you get near the margins of the continent, which is the classic prediction for most spreading dye models. You will probably want to save this matrix for further statistical and graphical analysis with other packages. Use the Save button at the top of the page and select the Expected Matrix to save as a .txt file to disk.

The "Deviation statistic" tab shows the results of the null model analysis. Once the 30 replications were completed and the expected matrix was calculated, an additional 100 iterations of the model were run. A goodness-of-fit statistic was calculated for each of these models compared to the expected matrix. The histogram of the 100 goodness-of-fit statistics was then compared to the goodness-of-fit statistic calculated with the original data. In this case, the observed deviation statistic was 5716.396. You can see that this value was larger than all 100 of the simulated values, which had an average deviation of 2315.016 (the exact number here will vary slightly among different simulation runs). In the Summary tab, you will see that the standardized effect size for this test is 4.26, meaning the observed statistic was more than four standard deviations larger than expected by chance.

Finally, the "Regression" tab shows simple statistics for the regression of observed versus expected species richness. If the model fits the data perfectly, the slope and r2 value will both equal 1.0. But in this case, you can see that the slope is very low, and the r2 is close to 0.0. The simple spreading dye model does not do a good job of predicting species richness in South America, at least across the entire domain.

Model Variations

Now let's briefly explore some model variations that are possible with BioGeoSim. First, try running the same model, but change the domain to "occupied cells," so we will only use those grid cells that had at least one species of Phaethornis present.

Since Phaethornisare not found in every South American grid cell, this change has the potential to improve the fit of the model. Although the r2 from the regression is still small (r2 = 0.09), the null model test is no longer statistically significant (standardized effect size = 0.92, p = 0.15; p values and standardized effect sizes will vary among runs).

Now let's move beyond the spreading dye model and consider a model in which the probability of origination and spread are proportional to measured elevational range in the area map. With the domain still set to occupied cells, change the seeding and colonization algorithms from "equiprobable" to "proportional to area map value". Retain contiguous dispersal with four cells.

This model requires a lot more computational effort from BioGeoSim and will take five minutes or to run. This model also fits the data poorly. If you examine the Expected matrix, you will see that the center of species diversity is shifted towards grid cells in the Andes Mountains (which have the highest elevational range), but this is not the observed center of diversity for Phaethornis.

As a final variation, you can see what the effect is of relaxing range cohesion, but modifying the previous model from contiguous dispersal to "range scatter," which scatters the range of each species within the domain, although probabilities of occurence will still be scaled to elevational range values.

This model runs faster, but fits the data even more poorly (see the deviation statistic and standardized effect size values), in part because the expected richness gradients are a bit more pronounced in the Andes due to the relaxation of range cohesion.

You may also wish to explore stepping stone models, which will require you to vary the Poisson parameter. At small Poisson values, the results will resemble range cohesion models, and at large Poisson values, the results will resemble range scatter models. However, some of these models can be very slow to run.

[ Top of Page ]

8. Literature Cited

Colwell, R.K., C. Rahbek, and N.J. Gotelli. 2004. The mid-domain effect and species richness patterns: what have we learned so far? American Naturalist 163: E1-E23.

Rahbek, C., N.J. Gotelli, R.K. Colwell, G.L. Entsminger, T.F.L.V.B. Rangel, and G.R. Graves. 2007. Predicting continental patterns of bird species richness with spatially explicit models. Proceedings of the Royal Society B 274: 165-174.

[ Top of Page ]

All Pages Copyright © 2007
by Acquired Intelligence, Inc. and Kesey-Bear
All rights reserved.