Merge remote-tracking branch 'origin/master' into faster_python_import

MTG · Dec 14, 2023 · 8ed5045 · 8ed5045
2 parents b9615f7 + 95c996e
commit 8ed5045
Show file tree

Hide file tree

Showing 32 changed files with 913 additions and 169 deletions.
diff --git a/FAQ.md b/FAQ.md
@@ -153,6 +153,8 @@ A lightweight version of Essentia for iOS can be compiled using the ```--cross-c
 
 You can also compile it for iOS simulator (so that you can test on your desktop) using ```--cross-compile-ios-sim``` flag.
 
+Please note that TensorFlow-based Essentia algorithms are not supported on iOS at the moment because we do not currently offer a TensorFlowLite wrapper.
+
 
 Compiling Essentia to ASM.js or WebAssembly using Emscripten
 ------------------------------------------------------------

diff --git a/doc/sphinxdoc/_templates/applications.html b/doc/sphinxdoc/_templates/applications.html
@@ -296,7 +296,18 @@ <h1>Applications</h1>
         <a href="https://github.com/leozimmerman/ofxAudioAnalyzer">ofxAudioAnalyzer</a> is an openFrameworks wrapper for Essentia. It provides audio analysis algorithms modified to process signals in real-time.
       </dd>
     </div>
-
+    <div class="row essnt-apps-page__container">
+      <dt class="col-xs-2 col-sm-3 col-md-2 essnt-apps-page__logo">
+        <a href="https://github.com/p3zo/gifsync" title="Go to GIF Sync">
+          <span class="essnt-apps-page__logo-text">
+            GIF Sync
+          </span>
+        </a>
+      </dt>
+      <dd class="col-xs-10 col-sm-9 col-md-10 essnt-apps-page__description">
+        <a href="https://github.com/p3zo/gifsync">GIF Sync</a> reassembles the frames of a GIF to sync its animation to the beat of an audio file.
+      </dd>
+    </div>
   </dl>
 
 {% endblock %}
diff --git a/doc/sphinxdoc/demos.rst b/doc/sphinxdoc/demos.rst
@@ -9,14 +9,21 @@ Examples of music audio analysis with Essentia algorithms using Essentia.js
 https://mtg.github.io/essentia.js/examples/
 
 
+Tempo estimation
+----------------
+
+Tempo BPM estimation with Essentia: https://replicate.com/mtg/essentia-bpm
+
+
 Essentia TensorFlow models
 --------------------------
 
 Examples of inference with the pre-trained TensorFlow models for music auto-tagging and classification tasks:
 
 - Music classification by genre, mood, danceability, instrumentation: https://replicate.com/mtg/music-classifiers
-- Music style classification with the Discogs taxonomy (400 styles). Overall track-level predictions: https://replicate.com/mtg/effnet-discogs
-- Music style classification with the Discogs taxonomy (400 styles). Segment-level real-time predictions with Essentia.js: https://essentia.upf.edu/essentiajs-discogs
+- Music style classification with the Discogs taxonomy (400 styles, MAEST model). Overall track-level predictions: https://replicate.com/mtg/maest
+- Music style classification with the Discogs taxonomy (400 styles, Effnet-Discogs model). Overall track-level predictions: https://replicate.com/mtg/effnet-discogs
+- Music style classification with the Discogs taxonomy (400 styles, Effnet-Discogs model). Segment-level real-time predictions with Essentia.js: https://essentia.upf.edu/essentiajs-discogs
 - Real-time music autotagging (50 tags) in the browser with Essentia.js: https://mtg.github.io/essentia.js/examples/demos/autotagging-rt/
 - Mood classification in the browser with Essentia.js: https://mtg.github.io/essentia.js/examples/demos/mood-classifiers/
 - Music emotion arousal/valence regression: https://replicate.com/mtg/music-arousal-valence

diff --git a/doc/sphinxdoc/models.rst b/doc/sphinxdoc/models.rst
@@ -25,7 +25,7 @@ If you use any of the models in your research, please cite the following paper::
       booktitle={International Conference on Acoustics, Speech and Signal Processing ({ICASSP})},
       year={2020}
     }
-    
+
 .. highlight:: default
 
 
@@ -137,6 +137,105 @@ Models:
 *Note: We provide models operating with a fixed batch size of 64 samples since it was not possible to port the version with dynamic batch size from ONNX to TensorFlow. Additionally, an ONNX version of the model with* `dynamic batch <https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bsdynamic-1.onnx>`_ *size is provided.*
 
 
+MAEST
+^^^^^
+
+Music Audio Efficient Spectrogram Transformer (`MAEST <https://github.com/palonso/MAEST/>`_) trained to predict music style labels using an in-house dataset annotated with Discogs metadata.
+We offer versions of MAEST trained with sequence lengths ranging from 5 to 30 seconds (``5s``, ``10s``, ``20s``, and ``30s``), and trained starting from different intial weights: from random initialization (``fs``), from `DeiT <https://doi.org/10.48550/arXiv.2012.12877>`_ pre-trained weights (``dw``), and from `PaSST <https://doi.org/10.48550/arXiv.2106.07139>`_ pre-trained weights (``pw``). Additionally, we offer a version of MAEST trained following a teacher student setup (``ts``).
+According to our study ``discogs-maest-30s-pw``, achieved the most competitive performance in most downstream tasks (refer to the `paper <http://hdl.handle.net/10230/58023>`_ for details).
+
+
+Models:
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-30s-pw</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-30s-pw-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-30s-pw-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-30s-pw-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-30s-pw-ts</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-30s-pw-ts-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-30s-pw-ts-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-30s-pw-ts-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-20s-pw</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-20s-pw-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-20s-pw-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-20s-pw-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-10s-pw</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-pw-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-pw-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-10s-pw-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-10s-fs</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-fs-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-fs-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-10s-fs-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-10s-dw</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-dw-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-10s-dw-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-10s-dw-1_embeddings.py
+
+    .. collapse:: ⬇️ <a class="reference external">discogs-maest-5s-pw</a>
+
+        |
+
+            [`weights <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-5s-pw-1.pb>`_, `metadata <https://essentia.upf.edu/models/feature-extractors/maest/discogs-maest-5s-pw-1.json>`_]
+
+            Model trained with a multi-label classification objective targeting 400 Discogs styles.
+
+            Python code for embedding extraction:
+
+            .. literalinclude:: ../../src/examples/python/models/scripts/feature-extractors/maest/discogs-maest-5s-pw-1_embeddings.py
+
+
+*Note: It is possible to retrieve the output of each attention layer by setting* ``output=StatefulParitionedCall:n`` *, where* ``n`` *is the index of the layer (starting from 1).*
+*The output from the attention layers should be interpreted as* ``[batch_index, 1, token_number, embeddings_size]``
+*, where the first and second tokens (i.e.,* ``[0, 0, :2, :]`` *) correspond to the*  ``CLS`` *and* ``DIST`` *tokens respectively, and the following ones to input signal.*
+
 OpenL3
 ^^^^^^
 
@@ -240,7 +339,7 @@ The name of these models is a combination of the classification/regression task
 *Note: TensorflowPredict2D has to be configured with the correct output layer name for each classifier. Check the attached JSON file to find the name of the output layer on each case.*
 
 
-Music genre and style 
+Music genre and style
 ^^^^^^^^^^^^^^^^^^^^^
 
 
@@ -2071,6 +2170,3 @@ Models:
             Python code for predictions:
 
             .. literalinclude :: ../../src/examples/python/models/scripts/tempo/tempocnn/deeptemp-k16-3_predictions.py
-
-
-
diff --git a/doc/sphinxdoc/research_papers.md b/doc/sphinxdoc/research_papers.md
@@ -79,6 +79,8 @@ Indexing music by mood: design and integration of an automatic content-based ann
 
 ## Emotion detection
 
+- Azuaje, G., Liew, K., Epure, E., Yada, S., Wakamiya, S., & Aramaki, E. (2023). Visualyre: multimodal album art generation for independent musicians. Personal and Ubiquitous Computing, 1-12.
+
 - S. Chowdhury, and G. Widmer. On perceived emotion in expressive piano performance: Further experimental evidence for the relevance of mid-level perceptual features. In International Society for Music Information Retrieval (ISMIR 2021), 2021.
 
 - Byun, S. W., Lee, S. P. A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms. Applied Sciences, 11(4), 1890, 2021.

diff --git a/pyproject-tensorflow.toml b/pyproject-tensorflow.toml
@@ -8,7 +8,7 @@ manylinux-x86_64-image = "mtgupf/essentia-builds:manylinux2014_x86_64"
 
 # Only support x86_64 for essentia-tensorflow
 build = "cp**-manylinux_x86_64"
-skip = ["pp*", "*-musllinux*"]
+skip = ["pp*", "*-musllinux*", "*i686"]
 
 environment = { PROJECT_NAME="essentia-tensorflow", ESSENTIA_PROJECT_NAME="${PROJECT_NAME}", ESSENTIA_WHEEL_SKIP_3RDPARTY=1, ESSENTIA_WHEEL_ONLY_PYTHON=1 }
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ build-verbosity = 3
 manylinux-x86_64-image = "mtgupf/essentia-builds:manylinux2014_x86_64"
 manylinux-i686-image = "mtgupf/essentia-builds:manylinux2014_i686"
 
-skip = ["pp*", "*-musllinux*"]
+skip = ["pp*", "*-musllinux*", "*i686"]
 
 environment = { PROJECT_NAME="essentia", ESSENTIA_PROJECT_NAME="${PROJECT_NAME}", ESSENTIA_WHEEL_SKIP_3RDPARTY=1, ESSENTIA_WHEEL_ONLY_PYTHON=1 }
 

diff --git a/src/algorithms/filters/iir.cpp b/src/algorithms/filters/iir.cpp
@@ -26,7 +26,7 @@ using namespace standard;
 
 const char* IIR::name = "IIR";
 const char* IIR::category = "Filters";
-const char* IIR::description = DOC("This algorithm implements a standard IIR filter. It filters the data in the input vector with the filter described by parameter vectors 'numerator' and 'denominator' to create the output filtered vector. In the litterature, the numerator is often referred to as the 'B' coefficients and the denominator as the 'A' coefficients.\n"
+const char* IIR::description = DOC("This algorithm implements a standard IIR filter. It filters the data in the input vector with the filter described by parameter vectors 'numerator' and 'denominator' to create the output filtered vector. In the literature, the numerator is often referred to as the 'B' coefficients and the denominator as the 'A' coefficients.\n"
 "\n"
 "The filter is a Direct Form II Transposed implementation of the standard difference equation:\n"
 "  a(0)*y(n) = b(0)*x(n) + b(1)*x(n-1) + ... + b(nb-1)*x(n-nb+1) - a(1)*y(n-1) - ... - a(nb-1)*y(n-na+1)\n"

diff --git a/src/algorithms/machinelearning/tensorflowpredict.cpp b/src/algorithms/machinelearning/tensorflowpredict.cpp
@@ -366,6 +366,7 @@ const Tensor<Real> TensorflowPredict::TFToTensor(
 TF_Output TensorflowPredict::graphOperationByName(const string nodeName) {
   int index = 0;
   const char* name = nodeName.c_str();
+  string newNodeName;
 
   // TensorFlow operations (or nodes from the graph perspective) return tensors named <nodeName:n>, where n goes
   // from 0 to the number of outputs. The first output tensor of a node can be extracted implicitly (nodeName)
@@ -374,22 +375,16 @@ TF_Output TensorflowPredict::graphOperationByName(const string nodeName) {
   string::size_type n = nodeName.find(':');
   if (n != string::npos) {
     try {
-      string::size_type next_char;
-      index = stoi(nodeName.substr(n + 1), &next_char);
-
-      if (n + next_char + 1 != nodeName.size()) {
-        throw EssentiaException("TensorflowPredict: `" + nodeName + "` is not a valid node name, the index cannot "
-                                "be followed by other characters. Make sure that all your inputs and outputs follow "
-                                "the pattern `nodeName:n`, where `n` in an integer that goes from 0 to the number "
-                                "of outputs of the node - 1.");
-      }
+      newNodeName = nodeName.substr(0, n);
+      name = newNodeName.c_str();
+      index = stoi(nodeName.substr(n + 1, nodeName.size()));
 
     } catch (const invalid_argument& ) {
       throw EssentiaException("TensorflowPredict: `" + nodeName + "` is not a valid node name. Make sure that all "
                               "your inputs and outputs follow the pattern `nodeName:n`, where `n` in an integer that "
                               "goes from 0 to the number of outputs of the node - 1.");
-    } 
-    name = nodeName.substr(0, n).c_str();
+    }
+
   }
 
   TF_Operation* oper = TF_GraphOperationByName(_graph, name);
-Original file line number
+Diff line change
@@ Expand Up @@
     You can also compile it for iOS simulator (so that you can test on your desktop) using ```--cross-compile-ios-sim``` flag.
+    Please note that TensorFlow-based Essentia algorithms are not supported on iOS at the moment because we do not currently offer a TensorFlowLite wrapper.
     Compiling Essentia to ASM.js or WebAssembly using Emscripten
     ------------------------------------------------------------
@@ Expand Down @@