Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate New User Statistics into Profile Retrieval #153

Merged
merged 1 commit into from
Dec 22, 2024

Conversation

TeachMeTW
Copy link
Contributor

@TeachMeTW TeachMeTW commented Nov 8, 2024

Overview

Enhances the user profile data retrieval process by incorporating newly stored statistics (pipeline_range, total_trips, labeled_trips, and last_call) into the existing with block. These statistics are now seamlessly integrated into the user dictionary, providing a more comprehensive set of user metrics for further analysis and dashboard visualization.

Changes Made

1. Enhanced Profile Data Retrieval

  • Existing Functionality:

    • The with block previously retrieved basic profile information such as platform, manufacturer, app_version, os_version, and phone_lang from the database and assigned them to the user dictionary.
  • New Enhancements:

    • Pipeline Range:
      • Retrieved pipeline_range data, including start_ts and end_ts, to understand the timeframe of the user's activity pipeline.
    • Trip Counts:
      • Extracted total_trips and labeled_trips to provide insights into the user's trip data and the extent of trip labeling.
    • Last API Call Timestamp:
      • Obtained last_call to track the most recent API interaction by the user.
    • Optional Formatting:
      • Included a formatted version of the last_call timestamp for better readability in reports and dashboards.

2. Updated with Block Structure

@TeachMeTW
Copy link
Contributor Author

@JGreenlee Please review

Copy link
Contributor

@JGreenlee JGreenlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good – I have a suggestion

utils/db_utils.py Outdated Show resolved Hide resolved
@JGreenlee
Copy link
Contributor

I didn't do a great job of explaining what I was looking for since I wrote it out on my phone. Here's what I meant by my last suggestion:

                # Fetch data for the user, cached for repeated queries
                if not profile_data:
                    profile_data = {}
                
                # Assign existing profile attributes to the user dictionary
                user['platform'] = profile_data.get('curr_platform')
                user['manufacturer'] = profile_data.get('manufacturer')
                user['app_version'] = profile_data.get('client_app_version')
                user['os_version'] = profile_data.get('client_os_version')
                user['phone_lang'] = profile_data.get('phone_lang')

                
                # Assign newly stored statistics to the user dictionary
                user['total_trips'] = profile_data.get('total_trips', 0)
                user['labeled_trips'] = profile_data.get('labeled_trips', 0)
                
                # Retrieve and assign pipeline range
                pipeline_range = profile_data.get('pipeline_range', {})
                user['pipeline_start_ts'] = pipeline_range.get('start_ts')
                user['pipeline_end_ts'] = pipeline_range.get('end_ts')
                
                # Retrieve and assign last API call timestamp
                user['last_call'] = profile_data.get('last_call')
                    
                # Optional: Format the last_call timestamp for readability
                if user['last_call']:
                    user['formatted_last_call'] = arrow.get(user['last_call']).format('YYYY-MM-DD HH:mm:ss')
                else:
                    user['formatted_last_call'] = None
                

If profile_data is falsy/None, it gets set to {} before we attempt to access its properties.

This allows the .get() calls to safely handle defaults, and we don't have to do all of this inside an if block.

@TeachMeTW
Copy link
Contributor Author

@JGreenlee Addressed comment

Copy link
Contributor

@JGreenlee JGreenlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my review of e-mission/e-mission-server#993, the code looks good but I am not sure it has been tested locally.

@JGreenlee
Copy link
Contributor

With this PR and e-mission/e-mission-server#993, I ran the pipeline on one fresh test user.
I used shankari_2016-07-22 and shankari_2016-07-25 giving 6 trips.

image

There are blank fields here, which makes sense because I did not yet access the diary.


I used the Simulator to run e-mission-phone and sign into this user. After refreshing the dashboard I see this:

image

last_call is still blank. Also, there are several missing columns. Where are pipeline_start_ts, pipeline_end_ts, and formatted_last_call ?

@JGreenlee
Copy link
Contributor

valid_uuids_cols in constants.py controls what columns of the df get shown in the table:

valid_uuids_columns = [
'user_token',
'user_id',
'update_ts',
'total_trips',
'labeled_trips',
'first_trip',
'last_trip',
'last_call',
'platform',
'manufacturer',
'app_version',
'os_version',
'phone_lang',
]

Either use the original column names or update the valid cols here.

(side note, maybe rename to VALID_UUIDS_COLS to match the other constants in constants.py)

@JGreenlee
Copy link
Contributor

After accessing the app via the Simulator, I ran the pipeline again trying to get last_call to populate. It was still blank.


To debug I looked at pipeline logs to see if it was being recorded. Searching for "last_call" in intake_single.log, here's what I found:

2024-12-04 12:31:17,297:DEBUG:140704732739904:Last call timestamp: 1733333418.8644428
2024-12-04 12:31:17,297:INFO:140704732739904:Last call timestamp: 1733333418.8644428
2024-12-04 12:31:17,299:DEBUG:140704732739904:Updating user 4fea8afc-f4b1-4b3a-af09-d4bf7ef9e7e3 with fields {'pipeline_range': {'start_ts': 1469204768.0, 'end_ts': 1469493031.0}, 'total_trips': 6, 'labeled_trips': 0, 'last_call_ts': 1733333418.8644428}
2024-12-04 12:31:17,300:DEBUG:140704732739904:User profile updated with data: {'pipeline_range': {'start_ts': 1469204768.0, 'end_ts': 1469493031.0}, 'total_trips': 6, 'labeled_trips': 0, 'last_call_ts': 1733333418.8644428}

It is recorded as last_call_ts but the dashboard is looking for last_call

@JGreenlee
Copy link
Contributor

recommended changes

diff --git a/utils/db_utils.py b/utils/db_utils.py
index 3c0af03..944638f 100644
--- a/utils/db_utils.py
+++ b/utils/db_utils.py
@@ -448,24 +448,23 @@ def add_user_stats(user_data, batch_size=5):
                 user['os_version'] = profile_data.get('client_os_version')
                 user['phone_lang'] = profile_data.get('phone_lang')
 
-                
                 # Assign newly stored statistics to the user dictionary
-                user['total_trips'] = profile_data.get('total_trips', 0)
-                user['labeled_trips'] = profile_data.get('labeled_trips', 0)
+                user['total_trips'] = profile_data.get('total_trips')
+                user['labeled_trips'] = profile_data.get('labeled_trips')
                 
                 # Retrieve and assign pipeline range
                 pipeline_range = profile_data.get('pipeline_range', {})
-                user['pipeline_start_ts'] = pipeline_range.get('start_ts')
-                user['pipeline_end_ts'] = pipeline_range.get('end_ts')
-                
+                start_ts = pipeline_range.get('start_ts')
+                end_ts = pipeline_range.get('end_ts')
+                if start_ts:
+                    user['first_trip'] = arrow.get(start_ts).format(time_format)
+                if end_ts:
+                    user['last_trip'] = arrow.get(end_ts).format(time_format)
+
                 # Retrieve and assign last API call timestamp
-                user['last_call'] = profile_data.get('last_call')
-                
-                # Optional: Format the last_call timestamp for readability
-                if user['last_call']:
-                    user['formatted_last_call'] = arrow.get(user['last_call']).format('YYYY-MM-DD HH:mm:ss')
-                else:
-                    user['formatted_last_call'] = None
+                last_call_ts = profile_data.get('last_call_ts')
+                if last_call_ts:
+                    user['last_call'] = arrow.get(last_call_ts).format('YYYY-MM-DD')
                 
             esdsq.store_dashboard_time(
                 "admin/db_utils/add_user_stats/process_user",
image

All cols are now filled in for the first user. I also tested a second user who has no data; all cols are blank.

@JGreenlee
Copy link
Contributor

JGreenlee commented Dec 4, 2024

Note that I formatted last_call as YYYY-MM-DD (date only, no time). If you recall, we acknowledged that this value may lag behind by a bit because the pipeline runs hourly, not updated in real time. We decided date-only is good enough because the use case here is just to determine if users are active or not.

Also note that I removed some default values, such as 0 for total_trips and labeled_trips. Think about it – there is a difference between:
i) the pipeline ran and recorded 0 labeled trips in the user stats
ii) the pipeline has not run and has not recorded any value

In scenario (i) we'd want to show the 0 value but in scenario (ii) I'd argue that a blank cell is a better choice (which is what happens if we allow it to be None)

@TeachMeTW
Copy link
Contributor Author

Tested via

./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2016-07-25 nrelop_dev-emulator-study_2
./e-mission-py.bash bin/debug/intake_single_user.py -e nrelop_dev-emulator-study_2
 
./e-mission-py.bash bin/debug/purge_user.py -e nrelop_dev-emulator-study_2```

On stm_ebike dump which had no uuids originally
 

@TeachMeTW
Copy link
Contributor Author

@JGreenlee Ready for rereview

Copy link
Contributor

@JGreenlee JGreenlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but depends on e-mission/e-mission-server#993 to be deployed at the same time

@shankari
Copy link
Contributor

@TeachMeTW the change is pretty straightforward, but the commit history is not.
I am going to rebase now, but please make sure to tidy up the commit history in all repos

Here's the first commit in the history here, for example
Screenshot 2024-12-21 at 5 15 56 PM

**Overview**

Enhances the user profile data retrieval process by incorporating newly
stored statistics (`pipeline_range`, `total_trips`, `labeled_trips`, and
`last_call`) into the existing `with` block. These statistics are now
seamlessly integrated into the `user` dictionary, providing a more
comprehensive set of user metrics for further analysis and dashboard
visualization.

1. **Enhanced Profile Data Retrieval**

- **Existing Functionality**:
  - The `with` block previously retrieved basic profile information such as `platform`, `manufacturer`, `app_version`, `os_version`, and `phone_lang` from the database and assigned them to the `user` dictionary.

- **New Enhancements**:
  - **Pipeline Range**:
    - Retrieved `pipeline_range` data, including `start_ts` and `end_ts`, to understand the timeframe of the user's activity pipeline.
  - **Trip Counts**:
    - Extracted `total_trips` and `labeled_trips` to provide insights into the user's trip data and the extent of trip labeling.
  - **Last API Call Timestamp**:
    - Obtained `last_call` to track the most recent API interaction by the user.
  - **Optional Formatting**:
    - Included a formatted version of the `last_call` timestamp for better readability in reports and dashboards.

2. **Updated `with` Block Structure**

- **Integration of New Statistics**:
  - The `with` block now not only fetches and assigns basic profile attributes but also retrieves and assigns the newly stored statistics from the database.

Additional commits squashed into this:
- Accidentally removed the process loop
- Simplified else block for user data
- Merge
- Addressed comment, modified else if block
- Updated COLS to be uniform like the other constants. Updated ADD USER STATS
@shankari shankari force-pushed the Refactor-User-Stats branch from fd32ba1 to 186fc73 Compare December 22, 2024 02:05
@shankari
Copy link
Contributor

shankari commented Dec 22, 2024

The change was fairly complex to rebase because the first change commented out a bunch of code
#153 (comment)
which was then reinstated in a different way in d241d37
and the first change here was committed before d241d37

Since the commits here were interspersed with commits that had already been merged, we needed to reorder before we could squash. But reordering wouldn't work because the base for the first commit here was the old timing, and so applying the change resulted in a merge conflict.

@TeachMeTW you really need to use have cleaner commits instead of pushing changes whenever you make any temporary change.

In the end, the changes were simple enough that I just generated a diff, and used patch to apply it as a single commit which I then force-pushed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Tasks completed
Development

Successfully merging this pull request may close these issues.

3 participants