Bass Connections 2022-2023

Motivation

Geospatial data has become increasingly important across disciplines as the availability of remote sensing data, including satellite imagery, proliferates. Large, continuous streams of data are publicly available from satellites such as Sentinel and Landsat. Traditionally, this data must be labeled to be used for machine learning models. However, labeling large datasets is expensive, requiring a lot of time and labor. Many large and labeled data sets are not diverse in geographic location and so cannot be generalized to the world. Currently, using a model pre-trained on imageNet can mitigate the need for large amounts of labeled data. But this method has limited application to remote sensing tasks due to the unique characteristics of satellite images. Recent research has shown that self-supervised learning (SSL) techniques have the potential to develop robust feature representations of geospatial imagery data with minimal task-specific labeled data, including Caron et al. (2020), Chen et al. (2020), and Calhoun et al. (2022). We present GeoNet, a dataset that exceeds the scale and diversity of any prior datasets, and evaluate its usefulness by comparing the performance of SSL models pre-trained using three different methods: SSL pretraining on ImageNet, SSL pretraining on GeoNet, and SSL pretraining on ImageNet further trained on GeoNet.

Key Contributions

Develop GeoNet

GeoNet is the largest and most diverse dataset to be curated for SSL in remote sensing. It contains over 10 million images across geographies, cities, rural regions, biomes, areas impacted by global change, across 8 half-seasons — the first and largest ever to capture geospatial, temporal, and semantic diversity for remote sensing data.

Test different training paradigms of SwAV on GeoNet

We test different training paradigms of SwAV on GeoNet to find that GeoNet outperforms state-of-the-art models in three out of five downstream recognition tasks and is competitive with other SSL models

Publish GeoEye

GeoEye, a large pre-trained encoder for RGB imagery, significantly reduces the quantity of labeled data required for downstream recognition tasks, making recognition methods far more accessible and compute-efficient to researchers

Our work follows the below 5 stages:

Self-Supervised Learning (SSL)

Historical uses of remote sensing imagery for machine learning have primarily involved supervised learning: the use of labeled input data to train a model to perform classification tasks. This requires large, labeled remote sensing imagery datasets. Manually labeling these datasets is extremely time-consuming, requires expert knowledge, and must be repeated for each new dataset. (ImageNet, one of the largest and most widely used natural image recognition datasets with 14 million training images, required years of human labeling!) Furthermore, existing datasets of this nature are limited in geographic diversity, representing mainly Europe and America. Training machine learning models on data that covers only the Global North limits their transferability to other parts of the world.

Self-supervised learning (SSL) methods, which train a model to perform classification tasks without the need for labeled input data, address these limitations and remove the need for large, task-specific labeled datasets. Through pre-training on large, diverse, unlabeled datasets, self-supervised models learn to extract robust, highly transferable features that can then be applied to a wide range of other tasks and domains. This is highly useful for remote sensing applications.

One popular SSL technique is contrastive learning, in which a model is trained to discriminate between positive examples (pairs of data that are similar in some way) and negative examples (pairs of data that are dissimilar). Many self-supervised learning models trained using contrastive learning have been shown to close the gap with and outperform supervised learning models. One popular SSL model trained using contrastive learning is SwAV (Swapping Assignments between Views), which eliminates the need for pairwise comparisons by using clustering-based methods that compare multiple views of the same image.

However, commonly used pretraining datasets for self-supervised image classification and object detection tasks, such as ImageNet, have limited application to remote sensing tasks because of satellite images’ unique characteristics. (Images taken by satellites capture observations from above, come in multi-spectral bands instead of RGB, and do not include a unique element). We are aware of no equivalent to ImageNet that is specifically intended for pre-training for use with satellite imagery data.

GeoNet Dataset

The graph shows how we drew samples from different locations covering the entire globe except Antarctica, demonstrating the geospatial diversity of GeoNet.

To create an immense dataset that captures geospatial, temporal, and semantic diversity, we select features that cover diversity across geographies (all continents and countries), cities (all major cities are included), rural regions (across all continents and countries), biomes (forests, deserts, tundra, etc.), areas impacted by global change (floods, fires, storms, etc.), across 8 half-seasons.

Data was collected from 7.2M unique 2.24km x 2.24km areas on all continents except Antarctica. Urban and rural areas comprise 60% of the 10 million images, 19% for natural disaster areas, 20% for land use and cover, and 1% for specific built environment features. Furthermore, GeoNet is temporally diverse as it involved extracting Sentinel-2 images from different time periods.

We used 4 sampling methods to build this dataset to have diversity across urban vs rural regions, across built vs natural environments, across different land covers, and across natural disaster prone regions.

Method 1: Urban vs Rural

The first method consists of urban sampling and rural sampling. For urban sampling, we first draw samples randomly from a Multivariate normal (MVN) distribution centered at city centers worldwide with probability weighted by log (population). We then draw a random sample coordinate from the corresponding MVN distribution located within a 50 km radius. For rural sampling, we randomly sample locations with population densities between 5 and 250 people per km^2 in a 30 X 30 arcsec (1 X 1 km at the equator approximately) resolution population density map.

Method 2: Built vs Natural Environment

The second method includes Seven categories of human-built features associated with climate change globally are identified using various datasets resulting in around 82,000 coordinates. all the coordinates of major human-built features that are associated with climate change, including airports, mining sites, power plants, ports, oil rigs, wind turbines, and dams.

Method 3: Land Type

The third method uses a stratified sampling approach to include a fixed amount of images are included for 11 land cover types: Artificial land, Cropland, Grassland, Tree-covered areas, Shrubs-covered areas, Herbaceous vegetation that is aquatic or regularly flooded, Mangroves, Sparse vegetation, Bare soil, Snow and glaciers, Water bodies.

Method 4: Natural Disasters

The fourth and final method samples locations vulnerable to climate change-related natural disasters. For each of the four categories of natural disaster (wildfires, cyclones, droughts, and floods), we use a “proxy variable” as the probability in a weighted random sampling, which either quantifies the severity of the disaster or its occurrence frequency.

Sampling Strategy with a Grid-Based Approach to avoid duplications and overlaps in extracted images

A 2.24x2.24 km grid is defined to match the image size for SwAV's input dimensions (224x224) and the resolution of Sentinel-2 images (10 m). Mapping coordinates into a discrete grid allows for the removal of nearby locations with overlapping scenes. Coordinated used to extract Sentinel-2 images are collected with various considerations:

10 million samples are generated using all four sampling methods after which the corresponding grid cells that contain these coordinates are determined. Images from oversampled grid cells are deleted by limited the number of images per grid cell to 4.
If a grid cell is sampled more than once, its corresponding images from different search periods are extracted enabling seasonal temporal diversity and reducing redundancy of the view.

Experiments

Experiment Design

We aim to compare the impact of different pretraining paradigms on the transfer learning performance of self-supervised learning models and fully supervised learning models. We look at the difference between pretraining on ImageNet, which is used in most fields, and pretraining on a remote sensing dataset. We hypothesize the latter would lead to better representations, given that remote sensing data is distinct from other types of data. We also compare the performance of our model GeoEye to that of MoCo-v2-fMoW and SeCo, two state-of-the-art SSL models trained on satellite imagery. The five different pretraining methods are shown in the table.

Table 1: Pretraining Methods

For each pretraining method, we apply and evaluate its transfer learning performance on five downstream benchmark inference tasks in Table 2. To this end, we developed a pipeline for applying pre-training models. As shown in the figure, the experiment mainly consists of two steps. First, we use the ResNet50 backbone from each pretrained encoder to extract representations. Second, we attach a decoder to the ResNet50 backbone, either Unet for segmentation tasks or a fully connected linear layer for classification tasks, and fine-tune the decoder using the labels provided by the benchmark dataset.

A learning rate of 1e-3 with a batch size of 16 is used for all experiments using the Adam optimizer. We adopt the hyperparameters from Calhoun (2022) to have comparable results. We restrict the training dataset to 64, 128, 256, 512, and 1024 images. This allows us to evaluate the transfer learning performance of each pretraining method and dataset under real-world conditions where labeled datasets are limited. Soft IoU loss based on Milletari (2016) is used for all semantic segmentation tasks, while cross-entropy loss is used for multi-class classification and multi-label classification.

Benchmarking

We assemble a list of accessible, well-studied, and highly representative benchmark datasets for two categories of downstream tasks: classification and semantic segmentation. From the list, we prioritized Sentinel-2 images with a wider geographical coverage and restricted the image resolution between 0.1 and 30 m/pixel. Preprocessing is conducted so that our benchmark images to all be of size 224 x 224 pixels, aligning with the image size of the GeoEye dataset used for pretraining our model. We evaluate GeoNet on these datasets by comparing its performance to the performance of fully-supervised models and other state-of-the-art SSL models.

Table 2: Benchmark Datasets Information and Examples

For all benchmark datasets, We evaluate GeoNet on these datasets by comparing its performance to the performance of fully-supervised models and other state-of-the-art SSL models as described in Table 1, following the steps described in section experiment design. In the fine-tuning phase on the labeled benchmark dataset, we use the Unet decoder for segmentation and linear classifier for classification tasks for all pre-training methods. The loss and accuracy metrics we adopted are reported as below.

Table 3: Evaluation Metrics for Each Benchmark Dataset

Results

We identified which methods perform better across training size. SwAV-3M appears to outperform other pre-training methods for the SEN12MS dataset, which has weak labels. For Big Earth Net, both SwAV 3M and SwAV Imagenet perform well in multilabel classification tasks, with SwAV 3M performing the best with a larger training size. For Deep Globe, SwAV-3M, SwAV Imagenet, and supervised Imagenet all performed well, with supervised imagenet performing best for a lower training size and SwAV imagenet performing best for a larger training size. SeCo performed noticeably worse, and MoCoV2_fMoW produced by far the lowest IOU results. For Sustain Bench, SwAV 3M and SwAV Imagenet outperform other pre-training methods when the training size is small, while increasing the training size, all models except SeCo perform similarly. For EuroSat, SwAV image net performs the best, and unfortunately SwAV 3M does not perform better than supervised.

The results of our experiments are summarized below. For each benchmark task, we plot the accuracy measurement at different training size used in fine-tuning, showing each pretraining method with each separate line.

SEN12MS

SwAV-3M consistently outperformed SeCo, mocoV2 trained on fMoW, supervised trained on imagenet, and SwAV trained on image net on the SEN12MS dataset. SEN12MS is a weakly-labeled semantic segmentation dataset, therefore all the pre-training methods are limited, hence the low IoU values. The IoU score for pre-training methods increases significantly as training size increases from 64 to 1024 images. The results also show that self-supervision can outperform supervised on this weak label segmentation task, except for SeCo.

BigEarthNet

SwAV-3M, SwAV trained on ImageNet, and mocoV2 trained on fMoW outperformed the other 2 models, for this multilabel classification task, at a top-1 precision score between 0.65 and 0.7. SwAV-3M performed slightly better than SwAV trained on ImageNet and mocoV2 trained on fMoW for smaller training sizes (64, 256, and 512). The improvement in performance gradually decreases with larger training sizes across the 5 different approaches.

SustainBench

Swav 3M and swav imageNet outperform other pretraining methods when the training size is small (64, 128, 256 and 512). However as the training size increases to 1024, the marginal benefit from swav pretraining compared to other methods decreases. Pretraining SWAV on 3M sentinel-2 images has a small improvement from Imagnet pretraining. Overall, the best IoU score for pretraing with SWaV on both our dataset and imagenet are comparable to those in Calhoun et al. (2022) at around 0.44.

EuroSat

MoCoV2, swav_imagenet, and supervised_imagenet all performed better than SwAV-3M. SeCo performed noticeably worse than the other tasks.

DeepGlobe

SwAV_3M, SwAV_imagenet, and supervised_imagenet all performed the best with IOU values comparable to those on Deep Globe’s website. MoCoV2_fMoW performed slightly lower than those three, and SeCo had the lowest results out of all of our data. While SwAV_3M had the best results in some areas, we could not conclusively say that it had the overall best performance from this data.

Key Takeaways

SSL models dominantly out-perform (except for SeCO) in different types of downstream tasks for the label-limited scenario.

SwAV pretraining is winning compared to other SSL methods. SwAV-3M is competitive, if not winning, on some of these datasets such as field delineation and SEN12MS.

SwAV3M does not outperform swav-imagenet could due to domain transferring from imagenet to satellite images have not been completed. We are doing more investigations.

Future Steps

Future steps in further evaluating the performance of our encoder on GeoNet could include the use of more benchmark datasets, expansion to higher resolution and addition of weak labels. Testing our encoder against benchmark datasets will more extensively show the diverse potential of its applications. Currently our dataset consists of images of resolution 10 m per pixel. We can further expand our testing on benchmark tasks which are higher in resolution and also expand the dataset to be higher resolution to see how our GeoNet self-supervised encoder performs on those tasks.

Our second hypothesis is that a natural language processing (NLP) encoder enables mappings between textual annotation and satellite images, allowing customized queries of climate data. The textual data has already been gathered and the self supervised encoder has been trained. As our next step, we want to create a natural language processing model to extract text features from the text data collected so we can support a variety of questions. Then combine those text features from the natural language encoder and the image features from our self supervised encoder from before training using the geographic data and related weak labels to both. After this combination the previously unlabeled images will now have text features. This could be presented using a basic web application that allows text searches from users and presents the related images!!

Our Team

Undergraduate Team Members

Margaret Brooks

Computer Science

Alex Desbans

Electrical & Computer Engineering, Computer Science

Neel Gajjar

Computer Science, Statistics

Julia Kourelakos