Mining Data from California's Geotracker Database

Geotracker is a public database that is used to store environmental data from regulated sites in California. I’m going to download some data of a contaminated site and clean it up to try to derive some insights about the site. I picked Alpine county because it is the least populated one and I want to work with a smaller file. Some of the more industrial counties can have thousands of sites.

URL <- "https://geotracker.waterboards.ca.gov/data_download/edf_by_county/AlpineEDF.zip"
download.file(URL, destfile='alpine.zip', method='curl')

After downloading the zipped file I need to unzip it and read in the .txt file. Let’s see what’s inside.

unzip('alpine.zip')
alpine <- read.delim("AlpineEDF.txt")
head(alpine)
A data.frame: 6 × 23
COUNTYGLOBAL_IDFIELD_PT_NAMELOGDATELOGTIMELOGCODESAMPIDMATRIXLABWOLABCODEANADATEBASISPARLABELPARVALPARVQLABDLREPDLUNITSDILFACLNOTE
<chr><chr><chr><chr><int><chr><chr><chr><chr><chr><chr><chr><chr><dbl><chr><dbl><dbl><chr><dbl><chr>
1AlpineT0600300005K1 2012-10-081110CLSRK1-15 SOCVJ0496CLSR2012-10-12WPHCD 0ND0.033 1.0MG/KG1
2AlpineT0600300005KMPUD#22008-12-301330CLSRKMPUD #2 W CSA0056CLSR2009-01-07NMOIL 79= 9.10050.0UG/L 1DU,DU
3AlpineT0600300005KMPUD#22002-04-161203KFR KMPUD #2 WXNA ALPS2002-04-18NMTBE 0ND0.250 0.5UG/L 1
4AlpineT0600300005KMPUD#22002-04-161203KFR KMPUD #2 WXNA ALPS2002-04-18NBZME 0ND0.250 0.5UG/L 1
5AlpineT0600300005KMPUD#22002-10-311805KFR KMPUD WELL #2WXNA ALPS2002-11-04NDCE11 0ND0.500 1.0UG/L 1
6AlpineT0600300005KMPUD#22002-10-311805KFR KMPUD WELL #2WXNA ALPS2002-11-04NBZME 0ND0.250 0.5UG/L 1

There’s documentation about all of the fields, but we can already make sense of some of them. GLOBAL_ID represents a site, and FIELD_PT_NAME is a well. PARLABEL is the code for the name of the contaminant and PARVAL is the concentration reported. There’s also helpful QA/QC information, and the latitude and longitude of the wells are in a different file on the geotracker website. We’ll focus on one site with the most observations reported.

library(dplyr)

alpine %>%
group_by(GLOBAL_ID) %>%
count() %>%
arrange(desc(n))
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
A grouped_df: 7 × 2
GLOBAL_IDn
<chr><int>
T060030001111764
T0600300005 5776
T0600397314 5131
T0600300013 3780
T0600300007 2145
T0600300008 1532
1

Looks like T0600300011 is our site. Let’s try to see what the main chemicals of concern and their concentrations are.

site <- alpine %>%
  filter(GLOBAL_ID == "T0600300011")
site %>%
  group_by(PARLABEL) %>%
  tally(sort = TRUE) %>%
  head()
A tibble: 6 × 2
PARLABELn
<chr><int>
BZ 260
BZME 260
EBZ 260
XYLENES1314203
XYLO 203
MTBE 194

The most results came from benzene (BZ), toluene (BZME), ethylbenzene (EBZ), xylene - isomers m & p (XYLENES1314), o-xylene (XYLO), and methyl-tert-butyl ether (MTBE). This is a hydrocarbon site. Now let’s try to see which wells are near the source area.

top_wells <- site %>%
filter(PARLABEL %in% c("BZ", "BZME", "EBZ", "XYLENES1314", "XYLO", "MTBE")) %>%
select(c("FIELD_PT_NAME", "LOGDATE", "MATRIX", "PARLABEL", "PARVAL", "PARVQ", "LABDL", "REPDL", "UNITS",)) %>%
group_by(FIELD_PT_NAME) %>%
tally(sort = TRUE) %>%
filter(n>100)

# the wells with the most data are:
top_wells
A tibble: 6 × 2
FIELD_PT_NAMEn
<chr><int>
MW-3167
MW-2161
MWA4157
MW-1155
MWA1143
MWA2143

The most sampled wells are MW-3, MW-2, MWA4, MW-1, MWA1, and MWA2. Naming conventions for wells can mean nothing or can mean a lot, and get wackier the longer a site has been around.

site <- site %>%
  filter(FIELD_PT_NAME %in% top_wells$FIELD_PT_NAME) %>%
  filter(PARLABEL %in% c("BZ", "BZME", "EBZ", "XYLENES1314", "XYLO", "MTBE")) %>%
  select(c("FIELD_PT_NAME", "LOGDATE", "MATRIX", "PARLABEL", "PARVAL", "PARVQ", "LABDL", "REPDL", "UNITS",)) 

head(site)
A data.frame: 6 × 9
FIELD_PT_NAMELOGDATEMATRIXPARLABELPARVALPARVQLABDLREPDLUNITS
<chr><chr><chr><chr><dbl><chr><dbl><dbl><chr>
1MW-12002-10-08WBZ 0ND5.05.0UG/L
2MW-12003-09-08WBZ 0ND5.05.0UG/L
3MW-12002-11-25WXYLO0ND5.05.0UG/L
4MW-12003-03-05WMTBE0ND5.05.0UG/L
5MW-12003-03-05WEBZ 0ND5.05.0UG/L
6MW-12003-03-05WEBZ 0ND0.50.5UG/L

Let's take a closer look at the units. The first six observations are reported in μg/L, but are they all?

First we’ll want to clean up the classes - the FIELD_PT_NAME should be a factor and LOGDATE should be a date.

site$FIELD_PT_NAME <- as.factor(site$FIELD_PT_NAME)
site$LOGDATE <- as.Date(site$LOGDATE)
site$MATRIX <- as.factor(site$MATRIX)
site$PARLABEL <- as.factor(site$PARLABEL)
site$PARVQ <- as.factor(site$PARVQ)
site$UNITS <- as.factor(site$UNITS)
levels(site$PARLABEL) <- list(Benzene="BZ", Toluene="BZME", Ethylbenzene="EBZ", Xylenes_MP="XYLENES1314", Xylene_O="XYLO", MTBE="MTBE")

summary(site)
 FIELD_PT_NAME    LOGDATE           MATRIX          PARLABEL  
 MW-1:155      Min.   :2002-07-11   W:926   Benzene     :179  
 MW-2:161      1st Qu.:2003-05-27           Toluene     :179  
 MW-3:167      Median :2004-05-05           Ethylbenzene:179  
 MWA1:143      Mean   :2005-11-06           Xylenes_MP  :132  
 MWA2:143      3rd Qu.:2008-05-05           Xylene_O    :132  
 MWA4:157      Max.   :2010-04-14           MTBE        :125  
     PARVAL        PARVQ        LABDL           REPDL         UNITS    
 Min.   :   0.00   = :135   Min.   :0.080   Min.   : 0.500   UG/L:926  
 1st Qu.:   0.00   ND:791   1st Qu.:0.380   1st Qu.: 0.500             
 Median :   0.00            Median :0.500   Median : 0.500             
 Mean   :  20.95            Mean   :1.909   Mean   : 3.568             
 3rd Qu.:   0.00            3rd Qu.:5.000   3rd Qu.: 5.000             
 Max.   :1150.00            Max.   :9.900   Max.   :50.000             

Turns out that they are all reported in μg/L, which is good, because that means the concentrations are probably fairly low - remember one μg/L is one part per billion. This isn't always the case - be sure to keep an eye out for units and do conversions as necessary before working with the data. The PARVQ tells us if the chemical was detected in the sample or below the detection limit (ND). For 791 of the observations, the chemical was not detected in the sample, while in only 135 observations a true detected concentration is reported. In cases where the chemical is not detected, is NA, zero, the lab detection limit, report detection limit, or something else reported?

site %>%
filter(PARVQ == 'ND') %>%
head()
A data.frame: 6 × 9
FIELD_PT_NAMELOGDATEMATRIXPARLABELPARVALPARVQLABDLREPDLUNITS
<fct><date><fct><fct><dbl><fct><dbl><dbl><fct>
1MW-12002-10-08WBenzene 0ND5.05.0UG/L
2MW-12003-09-08WBenzene 0ND5.05.0UG/L
3MW-12002-11-25WXylene_O 0ND5.05.0UG/L
4MW-12003-03-05WMTBE 0ND5.05.0UG/L
5MW-12003-03-05WEthylbenzene0ND5.05.0UG/L
6MW-12003-03-05WEthylbenzene0ND0.50.5UG/L

It looks like the value reported for non-detected samples is zero. This can cause problems when analyzing the data statistically. Let’s visualize what these concentrations look like at MW-1.

MW1 <- site %>%
filter(FIELD_PT_NAME == "MW-1")

MW2 <- site %>%
filter(FIELD_PT_NAME == "MW-2")

MW3 <- site %>%
filter(FIELD_PT_NAME == "MW-3")

MWA1 <- site %>%
filter(FIELD_PT_NAME == "MWA1")

MWA2 <- site %>%
filter(FIELD_PT_NAME == "MWA2")

MWA4 <- site %>%
filter(FIELD_PT_NAME == "MWA4")

library(ggplot2)

ggplot(MW1, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MW-1", x ="Date", y = "Concentration in ug/L")

ggplot(MW2, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MW-2", x ="Date", y = "Concentration in ug/L")

ggplot(MW3, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MW-3", x ="Date", y = "Concentration in ug/L")

ggplot(MWA1, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MWA1", x ="Date", y = "Concentration in ug/L")

ggplot(MWA2, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MWA2", x ="Date", y = "Concentration in ug/L")

png

png

png

png

png

At these wells it looks like there were some spikes before 2004, but nothing much going on later. Let’s look at the last well from our selection - MWA4.

ggplot(MWA4, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MWA4", x ="Date", y = "Concentration in ug/L")

png

This one has a lot higher concentrations - take a look at the values on the y axis. Yikes! Let’s see what the EPA Maximum Contaminant Levels (MCLs) are for these chemicals.

ggplot(MWA4, aes(x = LOGDATE, y = PARVAL, color = PARLABEL)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MWA4", x ="Date", y = "Concentration in ug/L") +
geom_hline(yintercept=5, linetype="dashed", color="#F8766D") +
geom_hline(yintercept=1000, linetype="dashed", color="#B79F00") +
geom_hline(yintercept=700, linetype="dashed", color="#00BA38") 

png

There is no MCL for MTBE (there are probably state guidelines) and the MCL for total xylenes is 10,000, which is off this chart. Benzene has the lowest MCL by far, and it is usually the chemical that drives cleanup. It’s a little hard for us to see it on this plot, so let’s zoom in some more.

MWA4_benzene <- MWA4 %>%
filter(PARLABEL == "Benzene")

ggplot(MWA4_benzene, aes(x = LOGDATE, y = PARVAL, color = "Benzene",)) + 
geom_line() + 
geom_point(alpha = 0.6) +
labs(title ="MWA4", x ="Date", y = "Benzene Concentration in ug/L") +
geom_hline(yintercept=5, linetype="dashed", color="red") 

png

Around 2004 this well has almost seven times the MCL of benzene, but it quickly went down to non-detectable or very low for the rest of the recorded period. If I was especially interested in this site I would try to request some of the accompanying reports and perform a Mann-Kendall analysis of trends in the wells, but mostly I wanted to show how you can download data from Geotracker and manipulate it in R.

This tutorial is done entirely in a jupyter notebook running an R kernal inside of a docker container. The source files are on github and if you have any questions about using jupyter or docker don’t hesitate to contact me!

Liz McConnell
Liz McConnell
Graduate Student, CSU Center for Contaminant Hydrology

My research interests include contaminant fate and transport, data analysis using statistics and machine learning, R programming, and geospatial analysis.

comments powered by Disqus

Related