Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add format versions count report #98

Merged
merged 1 commit into from
Dec 17, 2020
Merged

Conversation

tw4l
Copy link
Contributor

@tw4l tw4l commented Nov 30, 2020

Connected to #93

This PR implements the "Format versions count" report in the UI and API, with support for limiting results by start and end date. This commit also removes the Bootstrap alerts from the report selection template, as there is no longer the need for any "Coming soon"-type messages, and users should be able to tell that their reports are loading in a new tab without these alerts.

NB: With reference to this conversation in #65 (comment), I suspect that some changes will need to be made around implementing the start_date and end_date parameters before merging. This might mean changing this report to use files' ingestion Event date or it might mean modifying the "File format count" report to use AIP.create_date, as I've done here. My rationale for using AIP.create_date is that that date is likely the most visible date for Archivematica users, since it is displayed in both the Archival Storage tab and the Storage Service's Package tab. But honestly, I'm not really sure which is the better approach and would like some input, as I'd like to set a standard practice and use that moving forward wherever possible.

In many cases, file ingestion dates and AIP creation dates will be the same, or at least similar. But in some cases, as in the case of files that are stored in a transfer backlog for weeks or months prior to becoming part of a SIP (and, eventually, an AIP), they may end up being quite different.

So when we say "start date" and "end date" in relation to something like this report, or the "File format count" report, do we mean when the file was first ingested into an Archivematica instance, or when it became part of an AIP? Might that answer change for other reports? How might we better communicate what these dates mean to users? (I've tried in the API documentation with some success, I think, but am more at a loss for what to do in the GUI)

@tw4l tw4l force-pushed the dev/issue-93-format-version-count branch 8 times, most recently from fe51478 to ce12964 Compare December 2, 2020 22:35
@tw4l tw4l force-pushed the dev/issue-93-format-version-count branch from ce12964 to 21e78de Compare December 2, 2020 22:55
@tw4l tw4l changed the title WIP: Add format versions count report Add format versions count report Dec 2, 2020
@tw4l tw4l marked this pull request as ready for review December 2, 2020 22:58
@tw4l tw4l requested review from ross-spencer and peterVG December 2, 2020 22:58

:param date_string: Date string. Valid format is YYYY-MM-DD.

:returns: datetime.datetime object or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider doing the work of _prepare_datetimes_for_query(...) here and returning a consistent data type in this helper. For me there are two main clues/reasons:

  1. Mixing return types might cause complexity upstream. Testing for None vs. getting to work on a newly minted datetime.
  2. I appreciate the amount of thought gone into the date handling across the PR, but the story as I see it through the code, is that we're largely only dealing with date ranges which I believe these helpers are largely setup for handling going forward. I think we're always going to know with a min or a max. So I might suggest a function signature of, something like:
def parse_datetime_bounds(date_string, Upper=False):
   return datetime (or datetime + 1 day if upper==true) if valid,
   else return datetime.min if invalid, or datetime.max if invalid and upper==true

I think we might see a cleaner coding style appear by decoding early, rather than doing a little bit now, and a little bit later.

Comment on lines 21 to 26
if isinstance(date_string, str):
try:
return datetime.strptime(date_string, "%Y-%m-%d")
except ValueError:
pass
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can get away with something like:

def parse_date(date_string):
    """Parse date parameter string into datetime object or None."""
    try:
        return datetime.strptime(date_string, "%Y-%m-%d")
    except (ValueError, TypeError):
        pass

If you wanted to simplify this a little if the function is kept as is.

@ross-spencer
Copy link
Contributor

ross-spencer commented Dec 8, 2020

@tw4l thanks for this. Overall I can see this is a high quality PR, although I have to step into it in more detail.

I feel like AIP created date is exactly the right date to use here as I think this is specifically AIP level reporting. You might be interested in this ticket I mentioned about the archival storage tab too: archivematica/Issues#731

I have left some initial comments about the date handling which are good additions and I look forward to seeing applied to other existing reports.

What I am a little bit stuck on before I start to review in more detail, is the similarity in the shape of the data between the two data functions:

  1. curl -X GET "http://127.0.0.1:5000/api/data/aip-overview/1" -H "accept: application/json"
  2. curl -X GET "http://127.0.0.1:5000/api/data/format-version-count/1" -H "accept: application/json"

It reminds me of a conversation before, but the similarity feels too hard to ignore, and there should be something done to (my preference - consolidate, rather than duplicate or remove). So I'd just like to ask more about the rationale for 2 over 1. And I think we need to understand what happens to both now, with respect for what each of them offer. My initial instinct is that 2. should probably not have been created without acknowledgement of 1 (i.e. potentially adding to technical debt). I could be wrong, the partial duplication might be fine. Maybe this really is a case of repository level reporting, vs. starting to implement functions to support preservation watch and treatment?

@tw4l
Copy link
Contributor Author

tw4l commented Dec 11, 2020

Thank you for the feedback @ross-spencer! As always I appreciate your sharp mind and generosity.

I feel like AIP created date is exactly the right date to use here as I think this is specifically AIP level reporting. You might be interested in this ticket I mentioned about the archival storage tab too: archivematica/Issues#731

Fantastic, and thanks for the helpful context with that issue! Given that, I will plan to add a commit in this PR that modifies the "Format count report" view to also use the AIP created date, to bring the reports into line with each other.

I have left some initial comments about the date handling which are good additions and I look forward to seeing applied to other existing reports.

Thanks, these comments make a lot of sense! I agree with your insight that we'd be better off decoding early and always returning a datetime object, and will make changes accordingly.

What I am a little bit stuck on before I start to review in more detail, is the similarity in the shape of the data between the two data functions ... It reminds me of a conversation before, but the similarity feels too hard to ignore, and there should be something done to (my preference - consolidate, rather than duplicate or remove). So I'd just like to ask more about the rationale for 2 over 1.

This is fair, and truthfully it's just the result of an oversight on my part. I started with the UI report and worked backwards from there (i.e. deciding what the report looks like, then writing the Data endpoint accordingly, and finally adding the API endpoint to keep good coverage there). I suspect that the similarity and potential to extend what was already there would have been much more apparent to me if I had worked in the reverse order. But you're also right that we've had similar conversations before, so apologies for making us cover this ground again.

Given what you've pointed out, I see a few potential ways forward:

  1. Keep both sets of Data endpoints and API endpoints.
  2. Remove the new Data and API endpoints, and use the existing Data.aip_overview endpoint (adding file size, date limiting, and testing comparable to Data.format_versions_count) for the "Format versions count report" in the UI.
  3. Hybrid approach: Keep the new Data.format_versions_count Data endpoint and continue to use it for the "Format versions count report" in the UI, but remove the new API endpoint in favour of the existing aip_overview API endpoint. We could potentially also supplement the aip_overview API endpoint by adding aggregate file size, date limiting, and tests so that nothing's lost to consumers by dropping the format_version_count API endpoint.

I agree with you that having both API endpoints as they are doesn't add much value currently (your thought about "repository level reporting, vs. starting to implement functions to support preservation watch and treatment" is intriguing, but I'm not sure we're there yet) so that rules out approach 1 for me.

My first thought was to go with approach 2, even though it would mean throwing out a lot of the work in this PR. However, as this would require first fetching data from the Data.aip_overview endpoint and then (in another Data endpoint function or the Reporter view) significantly transforming the shape of that data and sorting it for use in the UI template, there would be costs in terms of speed and complexity vs. leaning on the database/ORM to do the heavy lifting and return the data in the shape we need. Personally, I don't see the benefits of this approach outweighing those costs.

We could also consider the inverse of approach 2 (i.e. keeping the new endpoints and removing the old), but I see value in keeping the API endpoints a bit generic so that other systems can consume them and use them for their own purposes and I think the aip_overview API endpoint exemplifies that well.

So as I write this, I'm inclined to go with the hybrid approach 3. In essence, I see this as optimizing for speed in the reporting UI and thoroughness/generality in the API endpoint. What do you think? Does that sound reasonable to you?

@ross-spencer
Copy link
Contributor

Thanks @tw4l for laying these options out. It's really helpful. I'm thinking through them. I was thinking it might benefit me to share some previous context/thinking here in person and I wonder impact any particular approach might have one some of the decisions we make in development moving forward, so I requested a Google Hangout with yourself and Peter if that sounds good?

Going through some of this again I don't feel is that big a deal. It's really timely. There's more that I could have done here to signpost intentions - and what I think might be a somewhat orphaned function in the data module/API endpoint certainly doesn't help! It's going to come up more for other folks as we get others contributing.

@tw4l
Copy link
Contributor Author

tw4l commented Dec 14, 2020

Sounds good, thanks @ross-spencer! Looking forward to discussing further this afternoon.

@tw4l
Copy link
Contributor Author

tw4l commented Dec 14, 2020

To sum up how we left the offline conversation:

  • I will move the format_version_count Data and API endpoints to a new module and namespace respectively so that there is a clear differentiation between endpoints that are for generic data consumption vs. endpoints that are feeding data for reports implemented in the UI.
  • I will assess the other existing endpoints for inclusion in the new module/namespace and move the ones that should be moved - e.g. the largest file report endpoints.
  • @ross-spencer will complete code review for this PR, with the understanding that I will be moving the endpoints and making the suggested changes to the helper for parsing date strings into datetime objects, but have not done so yet.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tw4l lots of this looking good. I'm submitting a first go code-review in the hope it might aid any of your efforts this week. I know you're working on the changes we discussed. There are a few small style things here, something to think about in the tests, and possibly some fortification to add to the report generation but I like the look of the report.


This fixture is used to create expected state which is then used to
test the Data.format_versions_count endpoint.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look good, I wonder if we can think about providing some helpers to make it easier to create fixtures? Simplify the constructors a little? Recognizing time restrictions this week I've explored simplifying this a little and come up with the following, but what do you think?

diff --git a/AIPscan/Data/tests/conftest.py b/AIPscan/Data/tests/conftest.py
index 6fb1e88..816cccf 100644
--- a/AIPscan/Data/tests/conftest.py
+++ b/AIPscan/Data/tests/conftest.py
@@ -3,6 +3,8 @@
 """This module defines shared Data blueprint pytest fixtures."""
 
 import datetime
+import uuid
+
 import pytest
 
 from AIPscan import db, create_app
@@ -27,6 +29,121 @@ AIP_1_CREATION_DATE = "2020-01-01"
 AIP_2_CREATION_DATE = "2020-06-01"
 
 
+def _create_test_storage_service(**kwargs):
+    """Wrapper to slightly more easily return a sample storage service
+    for conftest fixtures.
+
+    param: ...
+    param: ...
+    param: ...
+    """
+    name = kwargs.get("name", "test storage service")
+    url = kwargs.get("name", "http://example.com")
+    user_name = kwargs.get("user_name", "test")
+    api_key = kwargs.get("api_key", "test")
+    download_limit = kwargs.get("download_limit", 20)
+    download_offset = kwargs.get("download_offset", 10)
+    default = kwargs.get("default", True)
+    storage_service = StorageService(
+        name=name,
+        url=url,
+        user_name=user_name,
+        api_key=api_key,
+        download_limit=download_limit,
+        download_offset=download_offset,
+        default=default,
+    )
+    db.session.add(storage_service)
+    db.session.commit()
+    return storage_service
+
+
+def _create_fetch_job(**kwargs):
+    """Wrapper to slightly more easily return a sample fetch job for
+    conftest fixtures.
+
+    param: ...
+    param: ...
+    param: ...
+    """
+    total_packages = kwargs.get("total_packages", 1)
+    total_aips = kwargs.get("total_aips", 1)
+    total_deleted_aips = kwargs.get("total_deleted_aips", 0)
+    download_start = kwargs.get("download_start", datetime.datetime.now())
+    download_end = kwargs.get("download_end", datetime.datetime.now())
+    download_directory = kwargs.get("download_directory", "/some/directory/")
+    storage_service_id = kwargs.get("storage_service_id", 1)
+    fetch_job = FetchJob(
+        total_packages=total_packages,
+        total_aips=total_aips,
+        total_deleted_aips=total_deleted_aips,
+        download_start=download_start,
+        download_end=download_end,
+        download_directory=download_directory,
+        storage_service_id=storage_service_id,
+    )
+    db.session.add(fetch_job)
+    db.session.commit()
+    return fetch_job
+
+def _create_aip(**kwargs):
+    """Wrapper to slightly more easily return an AIP for conftest
+    fixtures.
+
+    param: ...
+    param: ...
+    param: ...
+    """
+    uuid_ = kwargs.get("uuid", str(uuid.uuid4()))
+    transfer_name = kwargs.get("transfer_name", "Test AIP")
+    create_date = kwargs.get("create_date", datetime.datetime.now())
+    storage_service_id = kwargs.get("storage_service_id", 1)
+    fetch_job_id = kwargs.get("fetch_job_id", 1)
+    return AIP(
+        uuid=uuid_,
+        transfer_name=transfer_name,
+        create_date=create_date,
+        storage_service_id=storage_service_id,
+        fetch_job_id=fetch_job_id,
+    )
+
+
+def _create_test_file(**kwargs):
+    """Wrapper to slightly more easily return a sample file for conftest
+    fixtures.
+
+    param: ...
+    param: ...
+    param: ...
+    """
+    name = kwargs.get("name", "file_name.ext")
+    filepath = kwargs.get("filepath", "/path/to/file_name.ext")
+    uuid_ = kwargs.get("uuid", str(uuid.uuid4()))
+    file_type = kwargs.get("file_type", FileType.original)
+    size = kwargs.get("size", "0")
+    date_created = kwargs.get("date_created", datetime.datetime.now())
+    puid = kwargs.get("puid", "fmt/test-1")
+    file_format = kwargs.get("file_format", "ACME File Format")
+    format_version = kwargs.get("format_version", "0.0.0")
+    checksum_type = kwargs.get("checksum_type", "test")
+    checksum_value = kwargs.get("checksum_value", "test")
+    aip_id = kwargs.get("aip_id", 1)
+    return File(
+        name=name,
+        filepath=filepath,
+        uuid=uuid_,
+        file_type=file_type,
+        size=size,
+        date_created=date_created,
+        puid=puid,
+        file_format=file_format,
+        format_version=format_version,
+        checksum_type=checksum_type,
+        checksum_value=checksum_value,
+        aip_id=aip_id,
+    )
+
+
 @pytest.fixture
 def app_with_populated_files(scope="package"):
     """Fixture with pre-populated data.
@@ -38,66 +155,30 @@ def app_with_populated_files(scope="package"):
     with app.app_context():
         db.create_all()
 
-        storage_service = StorageService(
-            name="test storage service",
-            url="http://example.com",
-            user_name="test",
-            api_key="test",
-            download_limit=20,
-            download_offset=10,
-            default=True,
-        )
-        db.session.add(storage_service)
-        db.session.commit()
-
-        fetch_job = FetchJob(
-            total_packages=1,
-            total_aips=1,
-            total_deleted_aips=0,
-            download_start=datetime.datetime.now(),
-            download_end=datetime.datetime.now(),
-            download_directory="/some/dir",
-            storage_service_id=storage_service.id,
-        )
-        db.session.add(fetch_job)
-        db.session.commit()
+        storage_service = _create_test_storage_service()
+        fetch_job = _create_fetch_job(storage_service_id=storage_service.id)
 
-        aip = AIP(
-            uuid="111111111111-1111-1111-11111111",
-            transfer_name="test aip",
-            create_date=datetime.datetime.now(),
+        aip = _create_aip(
             storage_service_id=storage_service.id,
             fetch_job_id=fetch_job.id,
         )
         db.session.add(aip)
         db.session.commit()
 
-        original_file = File(
-            name="original.tif",
-            filepath="/path/to/original.tif",
-            uuid="111111111111-1111-1111-11111111",
+        original_file = _create_test_file(
             file_type=FileType.original,
             size=ORIGINAL_FILE_SIZE,
-            date_created=datetime.datetime.now(),
             puid=TIFF_PUID,
             file_format=TIFF_FILE_FORMAT,
-            checksum_type="test",
-            checksum_value="test",
-            aip_id=1,
         )
-        preservation_file = File(
-            name="preservation.tif",
-            filepath="/path/to/preservation.tif",
-            uuid="222222222222-2222-2222-22222222",
+
+        preservation_file = _create_test_file(
             file_type=FileType.preservation,
             size=PRESERVATION_FILE_SIZE,
-            date_created=datetime.datetime.now(),
             puid=TIFF_PUID,
             file_format=TIFF_FILE_FORMAT,
-            checksum_type="test",
-            checksum_value="test",
-            aip_id=1,
         )
+
         db.session.add(original_file)
         db.session.add(preservation_file)
         db.session.commit()
@@ -114,45 +195,22 @@ def app_with_populated_format_versions(scope="package"):
     This fixture is used to create expected state which is then used to
     test the Data.format_versions_count endpoint.
     """
+    AIP_DATE_FORMAT = "%Y-%m-%d"
+
     app = create_app("test")
     with app.app_context():
         db.create_all()
 
-        storage_service = StorageService(
-            name="test storage service",
-            url="http://example.com",
-            user_name="test",
-            api_key="test",
-            download_limit=20,
-            download_offset=10,
-            default=True,
-        )
-        db.session.add(storage_service)
-        db.session.commit()
+        storage_service = _create_test_storage_service()
+        fetch_job = _create_fetch_job(storage_service_id=storage_service.id)
 
-        fetch_job = FetchJob(
-            total_packages=1,
-            total_aips=1,
-            total_deleted_aips=0,
-            download_start=datetime.datetime.now(),
-            download_end=datetime.datetime.now(),
-            download_directory="/some/dir",
-            storage_service_id=storage_service.id,
-        )
-        db.session.add(fetch_job)
-        db.session.commit()
-
-        aip1 = AIP(
-            uuid="111111111111-1111-1111-11111111",
-            transfer_name="test aip",
-            create_date=datetime.datetime.strptime(AIP_1_CREATION_DATE, "%Y-%m-%d"),
+        aip1 = _create_aip(
+            create_date=datetime.datetime.strptime(AIP_1_CREATION_DATE, AIP_DATE_FORMAT),
             storage_service_id=storage_service.id,
             fetch_job_id=fetch_job.id,
         )
-        aip2 = AIP(
-            uuid="222222222222-2222-2222-22222222",
-            transfer_name="test aip",
-            create_date=datetime.datetime.strptime(AIP_2_CREATION_DATE, "%Y-%m-%d"),
+        aip2 = _create_aip(
+            create_date=datetime.datetime.strptime(AIP_2_CREATION_DATE, AIP_DATE_FORMAT),
             storage_service_id=storage_service.id,
             fetch_job_id=fetch_job.id,
         )
@@ -160,34 +218,22 @@ def app_with_populated_format_versions(scope="package"):
         db.session.add(aip2)
         db.session.commit()
 
-        file1 = File(
-            name="jpeg_1.01.jpg",
-            filepath="/path/to/jpeg_1.01.jpg",
-            uuid="111111111111-1111-1111-11111111",
-            file_type=FileType.original,
+        file1 = _create_test_file(
             size=ORIGINAL_FILE_SIZE,
-            date_created=datetime.datetime.now(),
             puid=JPEG_1_01_PUID,
             file_format=JPEG_FILE_FORMAT,
             format_version=JPEG_1_01_FORMAT_VERSION,
-            checksum_type="test",
-            checksum_value="test",
             aip_id=aip1.id,
         )
-        file2 = File(
-            name="jpeg_1.02.jpg",
-            filepath="/path/to/jpeg_1.02.jpg",
-            uuid="222222222222-2222-2222-22222222",
-            file_type=FileType.original,
+
+        file2 = _create_test_file(
             size=PRESERVATION_FILE_SIZE,
-            date_created=datetime.datetime.now(),
             puid=JPEG_1_02_PUID,
             file_format=JPEG_FILE_FORMAT,
             format_version=JPEG_1_02_FORMAT_VERSION,
-            checksum_type="test",
-            checksum_value="test",
             aip_id=aip2.id,
         )
+
         db.session.add(file1)
         db.session.add(file2)
         db.session.commit()

The full script:

# -*- coding: utf-8 -*-

"""This module defines shared Data blueprint pytest fixtures."""

import datetime
import uuid

import pytest

from AIPscan import db, create_app
from AIPscan.models import StorageService, FetchJob, AIP, File, FileType


TIFF_FILE_FORMAT = "Tagged Image File Format"
TIFF_PUID = "fmt/353"

JPEG_FILE_FORMAT = "JPEG"

JPEG_1_01_FORMAT_VERSION = "1.01"
JPEG_1_01_PUID = "fmt/43"

JPEG_1_02_FORMAT_VERSION = "1.02"
JPEG_1_02_PUID = "fmt/44"

ORIGINAL_FILE_SIZE = 1000
PRESERVATION_FILE_SIZE = 2000

AIP_1_CREATION_DATE = "2020-01-01"
AIP_2_CREATION_DATE = "2020-06-01"


def _create_test_storage_service(**kwargs):
    """Wrapper to slightly more easily return a sample storage service
    for conftest fixtures.

    param: ...
    param: ...
    param: ...
    """
    name = kwargs.get("name", "test storage service")
    url = kwargs.get("name", "http://example.com")
    user_name = kwargs.get("user_name", "test")
    api_key = kwargs.get("api_key", "test")
    download_limit = kwargs.get("download_limit", 20)
    download_offset = kwargs.get("download_offset", 10)
    default = kwargs.get("default", True)
    storage_service = StorageService(
        name=name,
        url=url,
        user_name=user_name,
        api_key=api_key,
        download_limit=download_limit,
        download_offset=download_offset,
        default=default,
    )
    db.session.add(storage_service)
    db.session.commit()
    return storage_service


def _create_fetch_job(**kwargs):
    """Wrapper to slightly more easily return a sample fetch job for
    conftest fixtures.

    param: ...
    param: ...
    param: ...
    """
    total_packages = kwargs.get("total_packages", 1)
    total_aips = kwargs.get("total_aips", 1)
    total_deleted_aips = kwargs.get("total_deleted_aips", 0)
    download_start = kwargs.get("download_start", datetime.datetime.now())
    download_end = kwargs.get("download_end", datetime.datetime.now())
    download_directory = kwargs.get("download_directory", "/some/directory/")
    storage_service_id = kwargs.get("storage_service_id", 1)
    fetch_job = FetchJob(
        total_packages=total_packages,
        total_aips=total_aips,
        total_deleted_aips=total_deleted_aips,
        download_start=download_start,
        download_end=download_end,
        download_directory=download_directory,
        storage_service_id=storage_service_id,
    )
    db.session.add(fetch_job)
    db.session.commit()
    return fetch_job


def _create_aip(**kwargs):
    """Wrapper to slightly more easily return an AIP for conftest
    fixtures.

    param: ...
    param: ...
    param: ...
    """
    uuid_ = kwargs.get("uuid", str(uuid.uuid4()))
    transfer_name = kwargs.get("transfer_name", "Test AIP")
    create_date = kwargs.get("create_date", datetime.datetime.now())
    storage_service_id = kwargs.get("storage_service_id", 1)
    fetch_job_id = kwargs.get("fetch_job_id", 1)
    return AIP(
        uuid=uuid_,
        transfer_name=transfer_name,
        create_date=create_date,
        storage_service_id=storage_service_id,
        fetch_job_id=fetch_job_id,
    )


def _create_test_file(**kwargs):
    """Wrapper to slightly more easily return a sample file for conftest
    fixtures.

    param: ...
    param: ...
    param: ...
    """
    name = kwargs.get("name", "file_name.ext")
    filepath = kwargs.get("filepath", "/path/to/file_name.ext")
    uuid_ = kwargs.get("uuid", str(uuid.uuid4()))
    file_type = kwargs.get("file_type", FileType.original)
    size = kwargs.get("size", "0")
    date_created = kwargs.get("date_created", datetime.datetime.now())
    puid = kwargs.get("puid", "fmt/test-1")
    file_format = kwargs.get("file_format", "ACME File Format")
    format_version = kwargs.get("format_version", "0.0.0")
    checksum_type = kwargs.get("checksum_type", "test")
    checksum_value = kwargs.get("checksum_value", "test")
    aip_id = kwargs.get("aip_id", 1)
    return File(
        name=name,
        filepath=filepath,
        uuid=uuid_,
        file_type=file_type,
        size=size,
        date_created=date_created,
        puid=puid,
        file_format=file_format,
        format_version=format_version,
        checksum_type=checksum_type,
        checksum_value=checksum_value,
        aip_id=aip_id,
    )


@pytest.fixture
def app_with_populated_files(scope="package"):
    """Fixture with pre-populated data.

    This fixture is used to create expected state which is then used to
    test the Data.aips_by_file_format and Data.aips_by_puid endpoints.
    """
    app = create_app("test")
    with app.app_context():
        db.create_all()

        storage_service = _create_test_storage_service()
        fetch_job = _create_fetch_job(storage_service_id=storage_service.id)

        aip = _create_aip(
            storage_service_id=storage_service.id,
            fetch_job_id=fetch_job.id,
        )
        db.session.add(aip)
        db.session.commit()

        original_file = _create_test_file(
            file_type=FileType.original,
            size=ORIGINAL_FILE_SIZE,
            puid=TIFF_PUID,
            file_format=TIFF_FILE_FORMAT,
        )

        preservation_file = _create_test_file(
            file_type=FileType.preservation,
            size=PRESERVATION_FILE_SIZE,
            puid=TIFF_PUID,
            file_format=TIFF_FILE_FORMAT,
        )

        db.session.add(original_file)
        db.session.add(preservation_file)
        db.session.commit()

        yield app

        db.drop_all()


@pytest.fixture
def app_with_populated_format_versions(scope="package"):
    """Fixture with pre-populated data.

    This fixture is used to create expected state which is then used to
    test the Data.format_versions_count endpoint.
    """
    AIP_DATE_FORMAT = "%Y-%m-%d"

    app = create_app("test")
    with app.app_context():
        db.create_all()

        storage_service = _create_test_storage_service()
        fetch_job = _create_fetch_job(storage_service_id=storage_service.id)

        aip1 = _create_aip(
            create_date=datetime.datetime.strptime(
                AIP_1_CREATION_DATE, AIP_DATE_FORMAT
            ),
            storage_service_id=storage_service.id,
            fetch_job_id=fetch_job.id,
        )
        aip2 = _create_aip(
            create_date=datetime.datetime.strptime(
                AIP_2_CREATION_DATE, AIP_DATE_FORMAT
            ),
            storage_service_id=storage_service.id,
            fetch_job_id=fetch_job.id,
        )
        db.session.add(aip1)
        db.session.add(aip2)
        db.session.commit()

        file1 = _create_test_file(
            size=ORIGINAL_FILE_SIZE,
            puid=JPEG_1_01_PUID,
            file_format=JPEG_FILE_FORMAT,
            format_version=JPEG_1_01_FORMAT_VERSION,
            aip_id=aip1.id,
        )

        file2 = _create_test_file(
            size=PRESERVATION_FILE_SIZE,
            puid=JPEG_1_02_PUID,
            file_format=JPEG_FILE_FORMAT,
            format_version=JPEG_1_02_FORMAT_VERSION,
            aip_id=aip2.id,
        )

        db.session.add(file1)
        db.session.add(file2)
        db.session.commit()

        yield app

        db.drop_all()

It reduces the number of lines a little and helps with sensible defaults.

There are a few gaps in that I didn't have a position on adding param: for these. Also, I think the rebase will affect it slightly but the overall functions should remain the same.

We could also make this a different future commit if you like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I like this a lot @ross-spencer. Thanks for putting in a lot of effort and good thinking here! I'll do my best to get this into my revisions but appreciate the option to kick it to a future commit/PR if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpers added in commit dbe8350.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! The changes to conftest.py are looking good!

DATE_AFTER_AIP_2 = "2020-06-02"


class MockFormatVersionsCountQueryResult:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the sophistication here.

The approach also reminds me of named tuples which I use occasionally. They're a little more lightweight. But I've no opinion on which is best here. Just sharing for reference.

I use them in my Palettes work for the painter goblin: https://github.com/ross-spencer/palettes/blob/9a422fa8aeed79995b2395c87278d8d55cf71787/palettes.py#L13

And here's a link to "module of the week" for more context: https://pymotw.com/2/collections/namedtuple.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, this is new to me. Thanks for sharing!

version_count,
total_file_count,
total_file_size,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this looks good! 👍

// this into it's own import too.
//
// TODO: Are there more idiomatic approaches? 🤷
bootstrap_alert = function(message, alertType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see these removed!

var startdate = $('#startdate3').val();
if (startdate != "") {
if (startdate > date) {
alert("The start date must be before the end date");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the two alerts here are supposed to have the same string we can just remove the duplication with a const/var above? const DATE_ALERT = "...";.

My reason for asking if they are supposed, then I wonder if the end date alert might otherwise say, "the end date must be after the start date"? I have no preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added two new consts here - DATE_ALERT_START and DATE_ALERT_END - for the two variations of the message

@@ -54,8 +51,8 @@
<tr>
<td>Format version count</td>
<td></td>
<td></td>
<td></td>
<td><input type="text" id="startdate3" value="{{ start_date }}"></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been using integers as suffixes a little across this page, but do you think we need to think a little further ahead and perhaps name attributes more meaningfully, e.g. id="start_date_version_report" or something along those lines? Same for end date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point!

columns=translate_headers(headers),
versions=versions,
total_file_count=sum(version.get(data.FIELD_COUNT, 0) for version in versions),
total_size=sum(version.get(data.FIELD_SIZE, 0) for version in versions),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a stack trace error here:

Traceback (most recent call last):
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask_restx/api.py", line 639, in error_router
    return original_handler(e)
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/ross-spencer/git/artefactual-labs/AIPscan/AIPscan/Reporter/report_format_versions_count.py", line 37, in report_format_versions_count
    total_size=sum(version.get(data.FIELD_SIZE, 0) for version in versions),

It's not coming to me tonight why I might be getting None but I am wondering if you have an instinct? We might also need to capture that in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the check for None in the Data endpoint here, but agree that it would be good to capture this in tests! I will work on including that in my revisions.

<td>{{ version["Format"] }}</td>
<td>{{ version["Version"] }}</td>
<td>{{ version["Count"] }}</td>
<td>{{ version["Size"] | filesizeformat }}</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be related to the reporter module code, but here, I have a {'PUID': 'fmt/468', 'Format': 'ISO Disk Image File', 'Version': None, 'Count': 1, 'Size': None} result and the template expects a number to work with and not None so I wonder how we can fortify this or the data coming out of the DB (or into) which can help us to avoid this? I looked at the original METS and it looks like the ISO Disk Image File was ingested as a standard transfer from one of my FPR test sets - but it's likely the image from the sample data dirs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's likely that this File object has the format and size it does because of this potentially misleading AttributeError handling:

except AttributeError:
# File error/warning to log. Obviously this format may
# be incorrect so it is our best guess.
file_info["file_format"] = "ISO Disk Image File"
file_info["puid"] = "fmt/468"
(see also issue #83). I wonder if it's actually an ISO disk image...

Regardless, I think you hit the nail on the head that we should be checking for None in the data endpoint to avoid this issue. Good spot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah whoops, just re-read your comment and my "I wonder if it's actually an ISO disk image..." was off base here. Sorry about that! 😅

except AttributeError:
pass
version_info[FIELD_COUNT] = version.file_count
version_info[FIELD_SIZE] = version.total_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it looks like maybe we can test for none here? (see template comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you're right!

:start_date: Optional inclusive AIP creation start date
(datetime.datetime object).
:end_date: Optional inclusive AIP creation end date
(datetime.datetime object).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(datetime.datetime object).
:param storage_service_id: Storage Service ID.
:start_date: Optional inclusive AIP creation start date
(datetime.datetime object).
:end_date: Optional inclusive AIP creation end date
(datetime.datetime object).

I am still getting used to this more complex style of docstring. Admittedly I haven't spent a lot of time looking at this, but what do you think about adding a hanging indent: https://stackoverflow.com/a/50455731 - to help readability? It would be good to do that wherever we break a docstring over two lines. Initially it's hard to pick up as part of the same string.

Is there another idiomatic approach that you happen to know about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These detailed docstrings are a new-ish habit for me so I don't have a ready answer for other approaches but this is a good suggestion and I agree it would help a lot with readability.

@tw4l
Copy link
Contributor Author

tw4l commented Dec 17, 2020

Hi @ross-spencer - I believe I've responded to each of your suggestions and comments and this PR is now ready for code review. Thanks again for the great suggestions here!

In addition to responding to the in-line comments, I've rebased this branch on main and moved the report-specific endpoints to a new Data.report_data module and API.namespace_report_data namespace. This required a lot of changes in the Data, API, and Reporter blueprints and updates to the tests, so the diff might be a bit large. I've made sure that all tests are passing and done a quick manual spot check of each of the API endpoints and UI reports to make sure everything looks good following the change.

Along the way I discovered a bug in the agents_transfers endpoint that sometimes cause that report and API endpoint to raise an AttributeError, which I filed an issue for here: #104.

@tw4l tw4l requested a review from ross-spencer December 17, 2020 02:23
@tw4l
Copy link
Contributor Author

tw4l commented Dec 17, 2020

I realized right after marking this PR as ready for re-review that I forgot to update the "File format count" report to use AIP.create_date for date filtering, so I've added a follow-up commit with that change.

Copy link
Contributor

@ross-spencer ross-spencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question around style/presentation for the API URL, but approving, and:

santa

The way you approached moving things about made it really simple to re-review. Lots of this looking really clean and easy to use now too. Thanks @tw4l


This fixture is used to create expected state which is then used to
test the Data.format_versions_count endpoint.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! The changes to conftest.py are looking good!

data.FIELD_STORAGE_NAME: "Storage Service Name",
data.FIELD_TRANSFER_NAME: "Transfer Name",
data.FIELD_VERSION: "Version",
fields.FIELD_AIP_NAME: "AIP Name",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields module is a nice intuitive change it feels much easier to make use of now.

from AIPscan.Data import report_data

api = Namespace(
"report_data", description="Retrieve data optimized for AIPscan reports"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder which is nicer to have here:

With underscore: http://127.0.0.1:5000/api/report_data/format-version-count/1
With dash: http://127.0.0.1:5000/api/report-data/format-version-count/1

It looks like both are valid - my preference is with the dash, because we have dashes in the endpoint name too. IDK what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! I agree the dash is nicer and more consistent. I'll make that change before merging.

This commit also:

- Adds a new Data.report_date module and report-data API namespace
for data endpoints optimized for reports implemented in the AIPscan UI
as opposed to more general data overview endpoints.
- Modifies the file format count report to filter dates on
AIP.create_date to bring the two reports into line with each other.
- Adds new helper functions for creating test objects in the Data.test
package's conftest module.
@tw4l tw4l force-pushed the dev/issue-93-format-version-count branch from cb05db4 to e73fc58 Compare December 17, 2020 19:06
@tw4l tw4l merged commit e73fc58 into main Dec 17, 2020
@tw4l tw4l deleted the dev/issue-93-format-version-count branch December 17, 2020 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants