Update representations docs

mwalmsley · Oct 4, 2021 · 0ad2905 · 0ad2905
1 parent da69671
commit 0ad2905
Show file tree

Hide file tree

Showing 5 changed files with 89 additions and 83 deletions.
diff --git a/docs/autodoc/data_utils/catalog_to_tfrecord.rst b/docs/autodoc/data_utils/catalog_to_tfrecord.rst
@@ -6,10 +6,6 @@ catalog_to_tfrecord
 This module contains utilities to write galaxy catalogs (as pandas dataframes) into TFRecord files.
 These TFRecord files can be read from disk very quickly, which is useful for training ML models.
 
-.. autofunction:: zoobot.data_utils.catalog_to_tfrecord.write_catalog_to_train_test_tfrecords
-
-|
-
 .. autofunction:: zoobot.data_utils.catalog_to_tfrecord.write_image_df_to_tfrecord
 
 |

diff --git a/docs/guides/representations.rst b/docs/guides/representations.rst
@@ -3,32 +3,52 @@
 Representations
 ===============
 
-Representations are vectors that describe your input data, usually much lower-dimensional than the original input data.
-You might like to extract the representations learned by the model. Here's how.
+Representations are vectors that summarise your input data (here, galaxies).
+In the context of Zoobot, our models learn to convert the image data (pixels) into representations before using those representations to make predictions.
+You might like to extract the representations learned by a model - perhaps to use it directly for some new task, like a similarity search.
+Here's how.
 
-If you would like the representations of the trained GZ DECaLS model on the DECaLS DR5 galaxies, you can find them here (TODO Zenodo link).
-These were used for the morphology tools paper. If you need your own model and representations, read on.
+.. note:: 
 
-## Training a New Model
+    If you would like the representations of the trained GZ DECaLS model on DECaLS DR5 galaxies or on GZ2 galaxies, you can find them here (TODO Zenodo link).
+    These were used for the morphology tools paper. If you need your own model and representations, read on.
 
-Do this exactly like you normally would. It will work better on a broad multi-question task, like answering the GZ decision tree.
+Training a New Model
+--------------------
 
+Representations (in this context) are simply the activations of the model before the final dense layer(s).
+Training a model (optimizing the weights) teaches it to create useful representations of input images.
+Train exactly like you normally would. The representations may be more useful if you train on a broad multi-question task, like answering the GZ decision tree.
 See :ref:`reproducing_decals` for a guide to training a new model.
 
-## Extracting the representation
+We have published pretrained weights for models trained on GZ DECaLS - see :ref:`datanotes`. 
+You could start with these and calculate representations on some new galaxies.
+See ``make_predictions_loop.py`` for how to load the weights, and below for how to calculate representations.
 
-Extracting the representation is really just making a prediction, but without the final layers of the model.
-Run ``make_predictions_loop.py``, configuring the model (by commenting) like so:
+You might also want to start from a pretrained model and use finetuning to get the best representation for your problem.
+See ``finetune_advanced.py`` for an example. This adds some complexity, so we suggest trying with our pretrained weights first.
 
-- The base model should have ``include_top=False`` (we don't want the final dense layer)
-- Add a new top with just the global pooling
-- Group the base model and new top with ``tf.keras.Sequential``
-- Set the ``label_cols`` to be as long as the dimension of your representation (e.g. 1280) rather than the usual answers (e.g. 34)
+Extracting the Representation
+-----------------------------
+
+Extracting the representation is really just making a prediction, but without the final dense layers of the model.
+``make_predictions_loop.py`` includes a working example for you to copy.
+
+``make_predictions_loop.py`` can be used for three different kinds of predictions, depending on what you comment and uncomment.
+To save the representations, uncomment the block marked "For saving the activations (representations)" and comment the others.
+This configures the model like so:
+
+- Defines a base model with no head (``include_top=False``) as we don't want the final dense layer for making volunteer predictions.
+- Adds a new top with just ``GlobalAveragePooling2D``. This is the last layer we want to include. It just averages the first two axes of the previous (7x7)x1280-dim activations to 1280-dim.
+- Groups the base model and new top with ``tf.keras.Sequential``
+- Sets the ``label_cols`` to be as long as the dimension of the representation (e.g. 1280 for EfficientNet) rather than the usual answers (e.g. 34)
 
 As always, remember to check ``run_name`` and any file paths.
 
 ``make_predictions_loop.py`` will then save the representations for each galaxy to files like {run_name}_{index}.csv.
 These files are a bit awkward as they include lots of numbers like ``[[0.4, ...]]``.
 Remove the brackets with ``predictions/reformat_predictions.py``.
 
-Finally, compress the 1280-dim representation using PCA with ``representations/compress_representations.py``.
+Finally, compress the 1280-dim representation into a lower dimensionality using PCA with ``representations/compress_representations.py``.
+The compressed representation is mathematically very similar (PCA should preserve most of the interesting variation) but much easier to work with.
+
diff --git a/make_predictions_loop.py b/make_predictions_loop.py
@@ -84,7 +84,7 @@
     # schema = schemas.Schema(question_answer_pairs, dependencies)
     # label_cols = schema.label_cols
 
-    """For saving the activations - the model with no head"""
+    """For saving the activations (representations) - use the model with no head, only GlobalAveragePooling2D"""
     # base_model = define_model.load_model(
     #     checkpoint_dir,
     #     include_top=False,

diff --git a/zoobot/predictions/reformat_predictions.py b/zoobot/predictions/reformat_predictions.py
@@ -88,11 +88,13 @@ def main(raw_search_str, clean_search_str, reformatted_parquet_loc, overwrite=Fa
 
     overwrite = True
 
-    run_name = 'dr5_rings'
+    run_name = 'dr5_rings'  # the text identifying each of the output prediction csv's e.g. dr5_rings_full_features_0_5000.csv, etc.
 
     raw_search_str = 'data/results/{}_*_raw.csv'.format(run_name)
-    clean_search_str = raw_loc_to_clean_loc(raw_search_str)
+    clean_search_str = raw_loc_to_clean_loc(raw_search_str)  # simply gets new name e.g. 'data/results/{}_*_clean.csv'
     assert raw_search_str != clean_search_str
+
+    # each cleaned csv will be concatenated and saved here
     reformatted_parquet_loc = os.path.join(os.path.dirname(raw_search_str), '{}_cleaned_concat.parquet'.format(run_name))
 
     main(raw_search_str, clean_search_str, reformatted_parquet_loc, overwrite=overwrite)
diff --git a/zoobot/representations/compress_representations.py b/zoobot/representations/compress_representations.py
@@ -11,14 +11,29 @@
 from sklearn.decomposition import IncrementalPCA
 
 
-def create_pca_embedding(features, n_components, variance_plot_loc=None):
+def create_pca_embedding(features: np.array, n_components: int, variance_plot_loc=None):
+    """
+    Compress galaxy representations into a lower dimensionality using Incremental PCA.
+    These compressed representations are easier and faster to work with.
+
+    Args:
+        features (np.array): galaxy representations, of shape (galaxies, feature_dimensions)
+        n_components (int): number of PCA components to use. Sets output dimension.
+        variance_plot_loc (str, optional): If not None, save plot of variance vs. PCA components here. Defaults to None.
+
+    Raises:
+        ValueError: features includes np.nan values (PCA would break)
+
+    Returns:
+        np.array: PCA-compressed representations, of shape (galaxies, pca components)
+    """
     assert len(features) > 0
     pca = IncrementalPCA(n_components=n_components, batch_size=20000)
     reduced_embed = pca.fit_transform(features)
     if np.isnan(reduced_embed).any():
         raise ValueError(f'embed is {np.isnan(reduced_embed).mean()} nan')
 
-    if variance_plot_loc:  # only need for last one
+    if variance_plot_loc:
         plt.plot(range(n_components), pca.explained_variance_ratio_)
         plt.xlabel('Nth Component')
         plt.ylabel('Explained Variance')
@@ -29,81 +44,54 @@ def create_pca_embedding(features, n_components, variance_plot_loc=None):
 
 
 
-def main(features_cleaned_and_concat_loc, catalog_loc, name, output_dir):
-
-    # made by reformat_cnn_features.py
-    df = pd.read_parquet(features_cleaned_and_concat_loc)
-    df['png_loc'] = df['filename'].str.replace('/share/nas/walml/galaxy_zoo/decals/dr5/png/', '')
-
-    """join to catalog"""
-    catalog = pd.read_parquet(catalog_loc)
-    df = pd.merge(df, catalog, on='png_loc', how='inner').reset_index(drop=True)  # applies previous filters implicitly
-    df = df.sample(len(df), random_state=42).reset_index()
-    assert len(df) > 0
-    logging.info(len(df))
-
-    # # rename dr8 catalog cols
-    # df = df.rename(columns={
-    #     'weighted_radius', 'estimated_radius',  # TODO I have since improved this column, need to update
-    #     'dr8_id': 'galaxy_id'
-    # })
-
-    # rename dr5 catalog cols
-    df = df.rename(columns={
-        'petro_th50': 'estimated_radius',  # TODO I have since improved this column, need to update
-        'iauname': 'galaxy_id'
-    })
-
-    df.to_parquet(os.path.join(output_dir, '{}_full_features_and_safe_catalog.parquet'.format(name)), index=False)
+def main(df: pd.DataFrame, name: str, output_dir: str, components_to_calculate=[5, 10, 30], id_col='iauname'):
+    """
+    Wrapper around :meth:`create_pca_embedding`.
+    Creates and saves several embeddings using (by default) 5, 10, and 30 PCA components.
 
+    Args:
+        df (pd.DataFrame): with columns of id_col (below) and feat_* (e.g. feat_0_pred, feat_1_pred, ...) recording representations for each galaxy (row)
+        name (str): Text to identify saved outputs. No effect on results.
+        output_dir (str): Directory in which to save results. No effect on results.
+        id_col (str, optional): Name of column containing unique strings identifying each galaxy. Defaults to 'iauname', matching DECaLS catalog. 'id_str' may be useful to match GZ2 catalog.
+    """
     feature_cols = [col for col in df.columns.values if col.startswith('feat')]
 
     features = df[feature_cols].values
 
-    components_to_calculate = [5, 10, 30]
     for n_components in tqdm.tqdm(components_to_calculate):
 
+        # only need to bother with the variance plot for the highest num. components
         if n_components == np.max(components_to_calculate):
-            variance_plot_loc = 'explained_variance.pdf'
+            variance_plot_loc = os.path.join(output_dir, name + '_explained_variance.pdf')
         else:
             variance_plot_loc = None
 
         embed_df = create_pca_embedding(features, n_components, variance_plot_loc=variance_plot_loc)
-        embed_df['galaxy_id'] = df['galaxy_id']
+        embed_df[id_col] = df[id_col]  # pca embedding doesn't shuffle, so can copy the id col across to new df
         embed_df.to_parquet(os.path.join(output_dir, '{}_pca{}_and_ids.parquet'.format(name, n_components)), index=False)
-    
+
 
 if __name__ == '__main__':
 
     sns.set_context('notebook')
 
-    features_cleaned_and_concat_loc = '/share/nas/walml/repos/zoobot/data/results/dr5_color_cnn_features_concat.parquet'
-    catalog_loc = '/share/nas/walml/dr5_nsa_v1_0_0_to_upload.parquet'
-
-    name = 'dr5_color'
-    output_dir = '/share/nas/walml/repos/zoobot/data/results'
-
-    main(features_cleaned_and_concat_loc, catalog_loc, name=name, output_dir=output_dir)
-
-
-
-
-    # catalog_loc = '/raid/scratch/walml/repos/download_DECaLS_images/working_dr8_master.parquet'
-
-    # columns=['png_loc', 'weighted_radius', 'ra', 'dec', 'dr8_id']  # dr8 catalog cols
-
-    # TODO I think this has not been correctly filtered for bad images. Run the checks again, perhaps w/ decals downloader? Or check data release approach
-    # dr5_df = pd.read_parquet('dr5_b0_full_features_and_safe_catalog.parquet')
-    # """Rename a few columns"""
-    # print(dr5_df.head())
-    # dr5_df['estimated_radius'] = dr5_df['petro_th50']
-    # dr5_df['galaxy_id'] = dr5_df['iauname']
-
-
-    # df = pd.concat([dr5_df, dr8_df], axis=0).reset_index(drop=True)  # concat rowwise, some cols will have nans - but not png_loc or feature cols
-    # important to reset index else index is not unique, would be like 0123...0123...
-
-
-    # df[not_feature_cols].to_parquet('dr5_dr8_catalog_with_radius.parquet')
-
-    # TODO rereun including ra/dec and check for duplicates/very close overlaps
+    output_dir = '/Users/walml/repos/zoobot/data/results'
+    assert os.path.isdir(output_dir)
+
+    name = 'decals_dr5_oct_21' 
+    # made by reformat_predictions.py
+    features_loc = '/Volumes/beta/cnn_features/decals/cnn_features_decals.parquet'  # TODO point this to your download from Zenodo
+    df = pd.read_parquet(features_loc)
+    # TODO replace second arg with your image download folder
+    df['png_loc'] = df['png_loc'].str.replace('/media/walml/beta1/decals/png_native/dr5', '/Volumes/beta/decals/png_native/dr5')  
+    id_col = 'iauname'
+
+    # name = 'gz2' 
+    # features_loc = '/Volumes/beta/cnn_features/gz2/cnn_features_gz2.parquet'  # TODO point this to your download from Zenodo
+    # df = pd.read_parquet(features_loc)
+    # # TODO replace second arg with your image download folder
+    # df['png_loc'] = df['png_loc'].str.replace('/media/walml/beta1/galaxy_zoo/gz2/png', '/Volumes/beta/galaxy_zoo/gz2/png')  
+    # id_col = 'id_str'
+
+    main(df, name, output_dir, id_col=id_col)