[release-2.8] add the document for manually running the cronjob (#533)

* add the document for manually exec the job Signed-off-by: myan <[email protected]> * format the doc Signed-off-by: myan <[email protected]> * update the review Signed-off-by: myan <[email protected]> * add link to the troubleshooting Signed-off-by: myan <[email protected]> --------- Signed-off-by: myan <[email protected]> Co-authored-by: myan <[email protected]>
stolostron · Jul 10, 2023 · c35b08f · c35b08f
1 parent c7858a1
commit c35b08f
Show file tree

Hide file tree

Showing 2 changed files with 123 additions and 1 deletion.
diff --git a/doc/how_global_hub_works.md b/doc/how_global_hub_works.md
@@ -47,4 +47,122 @@ In the long run, the desired state is full compliance. If daily variations as ca
 Note:
 - multicluster global hub operator controls the life cycle of the multicluster global hub manager and global hub agent
 - kafka and database can run on the Global cluster or outside it
-- ACM hub also runs on the multicluster global hub cluster, but does not participate in the daily functioning of the global hub.
+- ACM hub also runs on the multicluster global hub cluster, but does not participate in the daily functioning of the global hub.
+
+### Running the Summarization Process manually
+
+Before starting, the first thing you need to know is that the process of this summary consists of two subtasks:
+- Insert the cluster policy data of that day from  [Materialized View](https://www.postgresql.org/docs/current/rules-materializedviews.html)  `local_compliance_view_<yyyy_MM_dd>` to `history.local_compliance`.
+- Update the `compliance` and policy flip `frequency` of that day to `history.local_compliance` based on `event.local_policies`.
+
+#### Execution steps
+
+1. Connect to the database
+
+   You can use clients such as pgAdmin, tablePlush, etc. to connect to the Global Hub database to execute the SQL statements involved in the next few steps. If your postgres database is installed through [this script](../operator/config/samples/storage/deploy_postgres.sh), you can directly connect to the database on the cluster through the following command.
+   ```bash
+   kubectl exec -it $(kubectl get pods -n hoh-postgres -l postgres-operator.crunchydata.com/role=master -o jsonpath='{.items..metadata.name}') -n hoh-postgres -c database -- psql -d hoh
+   ```
+       
+2. Determine the date that needs to be executed, take `2023-07-06` as an example
+
+    If you find on the dashboard that there is no any compliance information on `2023-07-06`, then find the the job failure information of the day after this day, that is `2023-07-07`, in `history.local_compliance_job_log`. In this way, it can be determined that `2023-07-06` is the date we need to manually execute the summary processes.
+
+3. Check whether the Materialized View `history.local_compliance_view_2023_07_06` exists
+    ```sql
+    select * from history.local_compliance_view_2023_07_06;
+    ```
+    - If the view exists, load the view records to `history.local_compliance`
+      ```sql
+      CREATE OR REPLACE FUNCTION history.insert_local_compliance_job(
+          view_date text
+      )
+      RETURNS void AS $$
+      BEGIN
+          EXECUTE format('
+              INSERT INTO history.local_compliance (policy_id, cluster_id, leaf_hub_name, compliance, compliance_date)
+              (
+                  SELECT policy_id, cluster_id, leaf_hub_name, compliance, %2$L 
+                  FROM history.local_compliance_view_%1$s
+                  ORDER BY policy_id, cluster_id
+              )
+              ON CONFLICT (policy_id, cluster_id, compliance_date) DO NOTHING',
+              view_date, view_date);
+      END;
+      $$ LANGUAGE plpgsql;
+      -- exec the insert func for that day '2023_07_06'
+      SELECT history.insert_local_compliance_job('2023_07_06');
+      ```
+
+    - If the view not exists, inherit the history compliance records of the day before that day, that is `2023_07_05`
+      ```sql
+      CREATE OR REPLACE PROCEDURE history.inherit_local_compliance_job(
+          prev_date TEXT,
+          curr_date TEXT
+      )
+      LANGUAGE plpgsql
+      AS $$
+      BEGIN
+          EXECUTE format('
+              INSERT INTO history.local_compliance (policy_id, cluster_id, leaf_hub_name, compliance_date, compliance, compliance_changed_frequency)
+              SELECT
+                  policy_id,
+                  cluster_id,
+                  leaf_hub_name,
+                  %1$L,
+                  compliance,
+                  compliance_changed_frequency
+              FROM
+                  history.local_compliance
+              WHERE
+                  compliance_date = %2$L
+              ON CONFLICT (policy_id, cluster_id, compliance_date) DO NOTHING
+          ', curr_date, prev_date);
+      END;
+      $$;
+      -- call the func to generate the  herit the data of '2023_07_05' and generate the data of '2023_07_06'
+      CALL history.inherit_local_compliance_job('2023_07_05', '2023_07_06');
+      ```
+
+4. Update the `compliance` and `frequency` information of that day to `history.local_compliance`
+    ```sql
+    CREATE OR REPLACE FUNCTION history.update_local_compliance_job(start_date_param text, end_date_param text)
+    RETURNS void AS $$
+    BEGIN
+        EXECUTE format('
+            INSERT INTO history.local_compliance (policy_id, cluster_id, leaf_hub_name, compliance_date, compliance, compliance_changed_frequency)
+            WITH compliance_aggregate AS (
+                SELECT cluster_id, policy_id, leaf_hub_name,
+                    CASE
+                        WHEN bool_and(compliance = ''compliant'') THEN ''compliant''
+                        ELSE ''non_compliant''
+                    END::local_status.compliance_type AS aggregated_compliance
+                FROM event.local_policies
+                WHERE created_at BETWEEN %1$L::date AND %2$L::date
+                GROUP BY cluster_id, policy_id, leaf_hub_name
+            )
+            SELECT policy_id, cluster_id, leaf_hub_name, %1$L, aggregated_compliance,
+                (SELECT COUNT(*) FROM (
+                    SELECT created_at, compliance, 
+                        LAG(compliance) OVER (PARTITION BY cluster_id, policy_id ORDER BY created_at ASC) AS prev_compliance
+                    FROM event.local_policies lp
+                    WHERE (lp.created_at BETWEEN %1$L::date AND %2$L::date) 
+                        AND lp.cluster_id = ca.cluster_id AND lp.policy_id = ca.policy_id
+                    ORDER BY created_at ASC
+                ) AS subquery WHERE compliance <> prev_compliance) AS compliance_changed_frequency
+            FROM compliance_aggregate ca
+            ORDER BY cluster_id, policy_id
+            ON CONFLICT (policy_id, cluster_id, compliance_date)
+            DO UPDATE SET
+                compliance = EXCLUDED.compliance,
+                compliance_changed_frequency = EXCLUDED.compliance_changed_frequency',
+            start_date_param, end_date_param);
+    END;
+    $$ LANGUAGE plpgsql;
+    -- call the func to update records start with '2023-07-06', end with '2023-07-07'
+    SELECT history.update_local_compliance_job('2023_07_06', '2023_07_07');
+    ```
+5. Once the above steps are successfully executed, you can find the records of that day generated in `history.local_compliance`. Then you can delete the Materialized View `history.local_compliance_view_2023_07_06` safely.
+    ```sql
+    DROP MATERIALIZED VIEW IF EXISTS history.local_compliance_view_2023_07_06;
+    ```
diff --git a/doc/troubleshooting.md b/doc/troubleshooting.md
@@ -113,3 +113,7 @@ Whereas `pg_restore` is used to restore a PostgreSQL database from an archive cr
 ```shell
 pg_restore -h another.host.com -p 5432 -U postgres -d hoh hoh-$(date +%d-%m-%y_%H-%M).tar
 ```
+
+## Cronjob Compliance Data Restore
+### Generate the missed compliance data
+The Grafana Datasource is mainly from a table named `history.local_compliance`. Its records are generated by a summarization routine kicks off at 00:00:00 every night. Usually, we don't need to run the summary process manually. However, unexpected errors may occur when running the compliance job, so it is necessary for us to manually log in to the database to perform the whole summary process to recover the data that is not generated. You can follow [Running the Summarization Process manually](./how_global_hub_works.md#running-the-summarization-process-manually) to achieve that.