Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tandem duplications #100

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Tandem duplications #100

wants to merge 4 commits into from

Conversation

hyanwong
Copy link
Owner

@hyanwong hyanwong commented Mar 7, 2024

WIP. You can test by running the following code:

from collections import Counter

from tests.test_gigutil import TestDTWF_one_break_no_rec_inversions_slow

test = TestDTWF_one_break_no_rec_inversions_slow()
test.progress_monitor = True
test.default_gens = 40  # increase number of gens for more consistent profiling
print(f"Duplication: {test.duplication} (size = {test.duplication.span} bp)")
test.test_tandem_duplication()
gig = test.gig.sample_resolve()
sample_lengths = [
    (gig.tables.iedges.child_right[gig.tables.iedges.child == s]).max()
    for s in gig.samples
]
c = Counter(sample_lengths)
print(c)
print("num edges", len(test.gig.iedges), ". Simplified to", len(gig.iedges))

For me, this takes about 10 mins, and results in a list of genome sizes that have lengths that differ by multiples of 450bp (the duplication size)

Duplication: Interval(left=50, right=500) (size = 450 bp)
Simulating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 158.86gen/s]
Simulating more: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39/39 [10:36<00:00, 16.32s/gen]
Counter({531: 96, 981: 64, 1431: 22, 1881: 11, 2331: 4, 2781: 2, 3231: 1})
num edges 15591 . Simplified to 8860

Summary by CodeRabbit

  • New Features
    • Added progress monitoring to simulation runs, enhancing user feedback during operations.
  • Tests
    • Expanded test coverage to include new simulation features and progress monitoring.
  • Chores
    • Updated project dependencies to include the tqdm package for progress tracking.

Copy link
Contributor

coderabbitai bot commented Mar 7, 2024

Walkthrough

This update enhances a genetic simulation utility, focusing on progress monitoring during simulations and supporting structural variations (SVs) like tandem duplications. It improves user experience by integrating tqdm for progress tracking and refines the simulation's flexibility and accuracy by adjusting parameters related to validation and SV handling.

Changes

Files Change Summary
tests/gigutil.py Added tqdm.auto import, replaced use_validation with skip_validate, added progress_monitor.
tests/test_gigutil.py Added constants and attributes for duplication and validation, updated method calls for progress monitoring and SV handling.
GeneticInheritanceGraphLibrary/tables.py Updated field type definitions for non-integer fields to enhance space efficiency.
tests/test_tables.py Added a test method for adding edges with negative chromosome numbers, included assertions for negative values.
requirements.txt Added tqdm package.

Related issues

Poem

In fields of code where logic hops,
🐇 A rabbit toils by lines and ops.
With progress bars that smoothly slide,
And duplications far and wide.
Through loops and functions, swift it leaps,
In binary burrows, deep it peeps.
"To simulate," it softly sings,
"Is to give data wings." 🚀

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between d726e99 and 85951bb.
Files selected for processing (3)
  • tests/gigutil.py (8 hunks)
  • tests/test_gigutil.py (9 hunks)
  • tests/test_graph.py (1 hunks)
Additional comments: 13
tests/test_graph.py (2)
  • 253-258: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]

The removal of the print(iedge_rows) statement from the test_all_svs_re_sample_resolve method streamlines the output of the test case related to iedge_rows verification. This change enhances the cleanliness of the test output, making it easier to identify relevant test results without the clutter of unnecessary print statements.

  • 253-258: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]

Overall, the test code in tests/test_graph.py is well-structured and covers a wide range of scenarios for testing the GeneticInheritanceGraphLibrary. The classes and methods are logically organized, and the tests appear to be thorough, ensuring that various aspects of the library are correctly validated. It's important to continue maintaining this level of detail and coverage as the library evolves.

tests/gigutil.py (7)
  • 3-3: The import statement for tqdm.auto is correctly added to support progress monitoring during simulations. This is in line with the PR objectives to enhance the testing framework with progress tracking capabilities.
  • 165-165: The addition of the progress_monitor parameter to the run method in the DTWF_simulator class is a good enhancement. It allows users to optionally enable progress tracking during simulations, which improves the interactivity and usability of the testing framework.
  • 188-189: Initializing the tqdm progress bar inside the run method when progress_monitor is provided is a practical approach. However, it's important to ensure that the progress_monitor parameter is documented in the method's docstring to inform users about its purpose and usage.

Consider adding documentation for the progress_monitor parameter in the method's docstring to improve code readability and user guidance.

  • 204-205: The call to progress_monitor.update(1) within the simulation loop correctly increments the progress bar for each generation simulated. This is a straightforward and effective way to provide feedback to the user on the simulation's progress.
  • 213-215: Adding the progress_monitor parameter to the run_more method is consistent with the changes made to the run method. It extends the progress monitoring feature to additional simulation runs, aligning with the PR's goal of enhancing user experience.
  • 229-230: Similar to the run method, initializing the tqdm progress bar in the run_more method when progress_monitor is provided is appropriate. Ensure that the usage of progress_monitor is documented in the method's docstring for clarity and consistency.

Please document the progress_monitor parameter in the run_more method's docstring to maintain consistency and provide clear guidance to users.

  • 234-235: The update to the progress bar in the run_more method is correctly implemented, mirroring the approach in the run method. This ensures that users receive consistent feedback across different simulation runs.
tests/test_gigutil.py (4)
  • 10-10: The addition of DUPLICATION_CHILD_FLAG constant is consistent with the existing pattern of defining flags for different types of child nodes. This is a good practice for maintaining readability and manageability of the code, especially when dealing with bitwise operations.
  • 172-172: The introduction of the duplication attribute as a tskit.Interval object is a clear and concise way to represent the interval of the genome affected by the duplication. This aligns well with the PR's objective to simulate and analyze the effects of tandem duplications on genome sizes.
  • 173-173: Adding a progress_monitor attribute to the test class and setting its default value to False is a straightforward approach to enable or disable progress monitoring during simulations. This flexibility is beneficial for users who may want to run simulations with or without progress tracking.
  • 181-185: The updates to method calls to include the progress_monitor parameter are consistent with the enhancements made to the testing framework. This ensures that progress monitoring can be enabled or disabled as per the user's choice during the simulation runs. It's important to verify that all relevant method calls have been updated to include this parameter.

Also applies to: 198-202, 205-210, 226-232

@hyanwong hyanwong force-pushed the tandem-duplications branch from 85951bb to 1a9cec0 Compare March 7, 2024 18:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between d726e99 and 1a9cec0.
Files selected for processing (4)
  • requirements.txt (1 hunks)
  • tests/gigutil.py (8 hunks)
  • tests/test_gigutil.py (9 hunks)
  • tests/test_graph.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • tests/gigutil.py
  • tests/test_graph.py
Additional comments: 4
requirements.txt (1)
  • 5-5: The addition of tqdm aligns with the PR's goal of introducing progress monitoring during simulations. Ensure that the version of tqdm added is compatible with the other packages and the overall project requirements.
tests/test_gigutil.py (3)
  • 10-10: The addition of DUPLICATION_CHILD_FLAG constant supports the new tandem duplication simulation feature. Ensure it's used consistently across the test suite.
  • 172-173: The addition of duplication and progress_monitor attributes in the test class enhances the simulation's analytical depth and user interaction. Verify that these attributes are utilized appropriately in the simulation methods.
  • 181-185: The updates to include the progress_monitor parameter in method calls align with the PR's objectives to enable progress tracking. Ensure that this parameter is correctly handled in each method to provide accurate progress updates.

Also applies to: 198-202, 205-210, 226-232, 262-262, 338-341, 490-494

Comment on lines 412 to 518
def test_tandem_duplication(self):
"""
Run a simulation in which a single tandem duplication is introduced then the
population explodes so that the duplication is not lost.

Unlike an inversion, we can't convert this to a tree sequence because
unequal recombination will keep generating new SVs during the lifetime
of the simulation.

Note that this routine can be called directly e.g.
from tests.test_gigutil import TestDTWF_one_break_no_rec_inversions_slow

test = TestDTWF_one_break_no_rec_inversions_slow()
test.test_tandem_duplication()
print(test.ts)
"""
self.simulator = sim.DTWF_one_break_no_rec_inversions_slow()
self.simulator.run(
num_diploids=2,
seq_len=self.seq_len,
gens=1,
random_seed=123,
further_node_flags=np.array(
[[DUPLICATION_CHILD_FLAG, 0], [0, 0]], dtype=np.int32
),
progress_monitor=self.progress_monitor,
)
# Insert a tandem duplication by editing the tables
tables = self.simulator.tables
times, inverses = np.unique(tables.nodes.time, return_inverse=True)
assert len(times) == 3
first_gen = np.where(inverses == np.where([times == 1])[0])[0]
second_gen = np.where(inverses == np.where([times == 0])[0])[0]
assert len(first_gen) == 4 # haploid genomes
assert len(second_gen) == 4 # haploid genomes
# Edit the existing iedges to create a duplication in one child
new_tables = tables.copy(omit_iedges=True)
dup_child_id = np.where(tables.nodes.flags == DUPLICATION_CHILD_FLAG)[0]
assert len(dup_child_id) == 1
dup_child_id = dup_child_id[0]
assert dup_child_id in second_gen
for ie in tables.iedges:
if ie.child == dup_child_id:
assert ie.child_right == ie.parent_right
if ie.child_left == 0:
assert ie.parent_left == 0
new_tables.add_iedge_row(
0,
self.duplication.right,
0,
self.duplication.right,
child=dup_child_id,
parent=ie.parent,
**self.simulator.add_iedge_params(),
)
new_tables.add_iedge_row(
self.duplication.right,
self.seq_len + self.duplication.span,
self.duplication.left,
self.seq_len,
child=dup_child_id,
parent=ie.parent,
**self.simulator.add_iedge_params(),
)
else:
# Don't add in the edge for the other parent. This duplication
# will not be associated with a recombination
pass
else:
new_tables.add_iedge_row(
**ie.asdict(), **self.simulator.add_iedge_params()
)
new_tables.sort()
self.simulator.tables = new_tables
# Check it gives a valid gig
gig = self.simulator.tables.copy().graph()

# Can progress the simulation
gig = self.simulator.run_more(
num_diploids=100,
gens=self.default_gens - 1,
random_seed=1,
progress_monitor=self.progress_monitor,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_tandem_duplication method is a valuable addition for simulating tandem duplications. Consider enhancing the method's documentation to explain the simulation steps more clearly, especially when editing the tables to insert a tandem duplication. Additionally, include assertions or checks to verify that the simulation results align with the expected outcomes, such as genome sizes differing by multiples of the duplication size.

@hyanwong hyanwong force-pushed the tandem-duplications branch from 1a9cec0 to f4c2597 Compare March 15, 2024 09:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a20aa6a and f4c2597.
Files selected for processing (4)
  • requirements.txt (1 hunks)
  • tests/gigutil.py (8 hunks)
  • tests/test_gigutil.py (10 hunks)
  • tests/test_graph.py (1 hunks)
Additional comments: 15
requirements.txt (1)
  • 5-5: The addition of tqdm to requirements.txt is correctly placed and aligns with the PR's objective to introduce progress monitoring during simulations. Good job maintaining the alphabetical order of the package list.
tests/test_graph.py (1)
  • 253-258: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-399]

The tests in test_graph.py appear comprehensive, covering a wide range of scenarios and edge cases for the GeneticInheritanceGraphLibrary. While the specific change mentioned in the summary (removal of a print statement) is not directly visible, it's essential to maintain high test coverage and clarity in test cases to ensure the library's robustness.

Ensure that the removal of the print statement does not affect the readability or usefulness of the test outputs.

tests/gigutil.py (3)
  • 3-3: The import of tqdm from tqdm.auto is correctly done and aligns with the PR's objective to introduce progress monitoring during simulations.
  • 165-165: The introduction and usage of the progress_monitor parameter in the run function are correctly implemented. Utilizing tqdm for progress monitoring enhances user interaction and aligns with the PR's objectives.

Also applies to: 188-189, 204-205

  • 213-215: The introduction and usage of the progress_monitor parameter in the run_more function are correctly implemented and consistent with the run function. This enhances user interaction and aligns with the PR's objectives.

Also applies to: 229-230, 234-235

tests/test_gigutil.py (10)
  • 10-10: The addition of DUPLICATION_CHILD_FLAG constant is a good practice for maintaining readability and avoiding magic numbers in the code. It's important to ensure that this flag does not conflict with other existing flags.
  • 172-172: Adding a duplication attribute to the test class enhances the flexibility of the simulation by allowing the specification of a duplicated region. Ensure that the interval specified does not overlap with other critical regions unless intended for specific tests.
  • 173-173: The addition of a progress_monitor attribute is a useful feature for tracking the progress of simulations, especially for long-running processes. It's important to ensure that the progress monitoring does not significantly impact the performance of the simulations.
  • 176-176: Introducing an iedge_validate attribute to the test class is a good practice for enabling or disabling validation checks dynamically. This can be particularly useful for debugging or when running simulations under different conditions.
  • 180-186: The method test_plain_sim has been updated to include the progress_monitor parameter in the method call. This is a good practice for allowing dynamic control over progress monitoring. Ensure that all calls to this method throughout the codebase have been updated to include this parameter.
  • 197-211: Similarly, the method test_run_more has been updated to include the progress_monitor parameter. It's important to maintain consistency in how this parameter is handled across different methods to avoid confusion.
  • 225-233: The method test_vs_tskit_implementation has been updated to include the progress_monitor parameter. This consistency in handling the progress monitoring feature across tests is commendable.
  • 247-257: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [254-263]

The method test_inversion introduces an inversion simulation. It's crucial to ensure that the logic for inserting an inversion by editing the tables is correct and does not inadvertently affect other parts of the simulation. Additionally, consider adding more comments to explain the rationale behind specific choices, such as the selection of further_node_flags.

  • 284-288: The conditional block checking if the breakpoint is far enough to the right is a good safety check. However, it's important to ensure that the error message provided is clear and actionable for the user. Consider providing suggestions on how to choose a different seed or adjust other parameters to meet the condition.
  • 416-501: The test_tandem_duplication method introduces the simulation of tandem duplications, which is a key feature of this PR. The method is well-structured and includes progress monitoring. Ensure that the logic for inserting a tandem duplication by editing the tables is thoroughly tested, especially considering the complexity of genomic simulations. Additionally, consider adding more detailed comments to explain the steps involved in the simulation, particularly for the editing of tables to create a duplication.

@hyanwong hyanwong force-pushed the tandem-duplications branch from f4c2597 to cd554b5 Compare March 15, 2024 09:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a20aa6a and cd554b5.
Files selected for processing (11)
  • GeneticInheritanceGraphLibrary/init.py (1 hunks)
  • GeneticInheritanceGraphLibrary/constants.py (1 hunks)
  • GeneticInheritanceGraphLibrary/graph.py (6 hunks)
  • GeneticInheritanceGraphLibrary/tables.py (16 hunks)
  • requirements.txt (1 hunks)
  • tests/conftest.py (1 hunks)
  • tests/gigutil.py (11 hunks)
  • tests/test_constants.py (1 hunks)
  • tests/test_gigutil.py (10 hunks)
  • tests/test_graph.py (4 hunks)
  • tests/test_tables.py (6 hunks)
Files skipped from review as they are similar to previous changes (2)
  • requirements.txt
  • tests/gigutil.py
Additional comments: 33
GeneticInheritanceGraphLibrary/__init__.py (2)
  • 9-9: The change from a wildcard import to a specific import of Const is a good practice as it improves clarity and avoids namespace pollution.
  • 11-12: The introduction of NULL and NODE_IS_SAMPLE constants is a good practice for code readability and maintainability. Ensure these constants are used consistently throughout the codebase.
Verification successful

The usage of NULL and NODE_IS_SAMPLE constants throughout the codebase has been verified. These constants are consistently used in both the library code and tests, aligning with good practices for code readability and maintainability.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify usage of NULL constant
rg --type py 'Const\.NULL'
# Verify usage of NODE_IS_SAMPLE constant
rg --type py 'Const\.NODE_IS_SAMPLE'

Length of output: 674

tests/test_constants.py (1)
  • 1-20: The tests for ValidFlags are well-structured and cover various aspects of its functionality. Consider adding tests for any untested flag combinations to ensure comprehensive coverage.
GeneticInheritanceGraphLibrary/constants.py (1)
  • 7-122: The addition of the Const and ValidFlags classes enhances code readability and maintainability by defining constants and flags used across the codebase. Ensure these constants and flags are used consistently and correctly throughout the codebase.
Verification successful

The usage of the Const and ValidFlags classes across various parts of the codebase, including tests and library modules, supports the initial assessment that these additions enhance code readability and maintainability. The evidence indicates that these classes are being used consistently and correctly throughout the codebase.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify usage of Const class constants
rg --type py 'Const\.[A-Z_]+'
# Verify usage of ValidFlags
rg --type py 'ValidFlags\.[A-Z_]+'

Length of output: 10293

tests/conftest.py (1)
  • 209-223: The adjustments in the trivial_gig function refine the test setup by modifying node data and edge definitions. Ensure these changes do not inadvertently affect the intended test outcomes.
GeneticInheritanceGraphLibrary/graph.py (2)
  • 13-14: Refactoring to use Const and ValidFlags from the constants module enhances clarity and consistency. Ensure these changes are applied consistently throughout the file.
Verification successful

The usage of Const and ValidFlags throughout the codebase, including in tests and main library files, demonstrates a consistent application of these constants. This aligns with the goal of enhancing clarity and consistency in the code. The refactoring appears to have been successfully applied across various parts of the project.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify consistent application of Const and ValidFlags
rg --type py 'Const\.[A-Z_]+'
rg --type py 'ValidFlags\.[A-Z_]+'

Length of output: 10293

* 108-108: The usage of `Const` values in method calls improves code clarity and consistency. Verify their correct application in all relevant method calls to ensure consistency across the codebase.

Also applies to: 150-150, 340-340, 441-441, 449-449

Verification successful

The script output demonstrates a consistent application of Const values in method calls across various parts of the codebase, including both tests and the main library files. This suggests that the usage of Const values for clarity and consistency, as mentioned in the review comment, has been correctly applied in the observed instances. Without specific counterexamples or a broader analysis of the entire codebase, it's reasonable to conclude that the review comment is verified based on the available information.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify correct application of Const values in method calls
rg --type py 'Const\.[A-Z_]+'

Length of output: 4008

tests/test_gigutil.py (5)
  • 10-10: Introduced a new constant DUPLICATION_CHILD_FLAG for use in tandem duplication simulations.

This addition enhances the clarity and maintainability of the code by using a descriptive constant instead of a hard-coded value.

  • 172-172: Added a duplication attribute to the test class to represent the interval affected by tandem duplication.

This is a good practice as it centralizes the definition of the duplication interval, making it easier to adjust for different tests if needed.

  • 173-173: Introduced a progress_monitor attribute to control the display of progress during simulations.

This addition improves user interaction by providing an option to monitor the progress of long-running simulations.

  • 176-176: Added an iedge_validate attribute to enable or disable validation of iedges during simulations.

This flexibility is beneficial for testing, allowing for performance optimizations by skipping validation when it's known to be unnecessary.

  • 416-507: Added a new method test_tandem_duplication to simulate and test the impact of tandem duplications on genome sizes.

This method is a significant addition to the testing framework, enabling the simulation of genomic variations due to tandem duplications. It's well-structured and includes detailed comments explaining the process and its implications.

tests/test_tables.py (5)
  • 137-144: The assertions in the TestExtractColumn class have been updated to check for specific values in the tables' nodes and edges. This is a critical part of ensuring the integrity of the data structure after operations such as additions or modifications. It's important to verify that these values align with the expected outcomes of the operations being tested, especially in the context of genomic simulations where accuracy is paramount.
  • 217-219: The use of ValidFlags.GIG to assert the flags before and after adding a row to the iedges table is a good practice for ensuring that the flags are correctly set and maintained. However, it's crucial to ensure that the ValidFlags.GIG constant accurately represents the expected state of the flags in all scenarios where it's used.
  • 225-231: The introduction of a params dictionary to pass additional parameters and validation flags to the add_iedge_row method enhances the flexibility and readability of the test. It's important to ensure that all possible combinations of flags and parameters are thoroughly tested to cover different scenarios and edge cases.
  • 342-360: The parameterized test test_flags in the TestIedgesValidation class is a robust way to validate the behavior of the add_iedge_row method under various flags and conditions. It's essential to ensure comprehensive coverage of all relevant flags and to verify that the assertions accurately reflect the expected outcomes.
  • 361-367: The test test_bad_flags correctly identifies scenarios where invalid flags involving the node table are used. This is crucial for ensuring the integrity of the edge table operations and preventing unintended modifications to the node table. It's important to verify that the error messages are clear and informative for debugging purposes.
GeneticInheritanceGraphLibrary/tables.py (16)
  • 12-13: Renaming constants to Const and ValidFlags improves readability and standardizes naming conventions across the project.
  • 16-17: Using aliases for tskit.NULL and tskit.NODE_IS_SAMPLE through Const enhances code clarity and maintainability.
  • 44-45: Setting default values for child_chromosome and parent_chromosome in IEdgeTableRow to 0 is a sensible choice, ensuring that these fields have a predictable initial state.
  • 91-91: Assigning Const.NULL as the default value for individual in NodeTableRow aligns with the use of constants for default values, promoting consistency.
  • 284-297: Initializing and copying logic in IEdgeTable with deep copying of _id_range_for_child ensures that internal data structures are correctly managed during copy operations. This is crucial for maintaining the integrity of data structures that track relationships between entities.
  • 304-307: The initialization of flags and _id_range_for_child in IEdgeTable.clear method ensures that the table is reset to a valid and consistent state. This is important for maintaining the integrity of the table's data.
  • 388-472: Adding the _validate_add_row method in IEdgeTable for enhanced validation checks is a positive step towards ensuring data integrity. It's important to ensure that all validation logic is thoroughly tested, especially for complex conditions involving chromosome IDs and interval overlaps.
  • 499-514: The _check_ints method in IEdgeTable improves validation by checking multiple values for integer constraints. This method enhances data integrity by ensuring that critical fields are integers. It's a good practice to ensure that this method is used consistently wherever applicable.
  • 531-538: Updating the ids_for_child method in IEdgeTable to handle chromosome IDs adds flexibility and specificity to querying by child and chromosome. This enhancement improves the utility of the method for more detailed analyses.
  • 545-565: Adding the max_child_pos method in IEdgeTable is a valuable addition for efficiently determining the maximum child position for a given child ID and chromosome. This method enhances the table's capabilities for analyzing genomic data.
  • 619-624: Introducing a property for accessing child chromosome IDs as a numpy array in IEdgeTable facilitates efficient data manipulation and analysis. This is a useful enhancement for working with genomic data.
  • 640-645: Similarly, adding a property for accessing parent chromosome IDs as a numpy array in IEdgeTable is beneficial for consistency and ease of data analysis.
  • 867-892: The _validate_add_iedge_row method in Tables class introduces additional validation checks related to node times. This method enhances the robustness of data integrity checks, especially for ensuring that parent nodes are older than child nodes. It's important to ensure comprehensive testing for these time-based validations.
  • 894-918: Enhancing the add_iedge_row method in Tables class to include node-specific validation checks is a significant improvement for ensuring data integrity. This method provides a more comprehensive approach to validation by considering both iedge and node properties.
  • 977-983: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [980-989]

The update to the sort method in Tables class to include sorting by child chromosome is a thoughtful addition that enhances the sorting logic. This change ensures that iedges are sorted in a more detailed and meaningful way, which is important for data consistency and analysis.

  • 994-1013: The documentation for the graph method in Tables class provides clear conditions for generating a genetic inheritance graph. This documentation is crucial for users to understand the requirements and limitations of converting tables into a GIG. It's important to ensure that the documentation is kept up-to-date with any changes to the validation logic or table structure.

Comment on lines 251 to 263
def _check_int(i, convert=False):
if isinstance(i, (int, np.integer)):
return i
try:
if i.is_integer():
if convert and i.is_integer():
return int(i)
raise ValueError(f"Expected {k + ' to be ' if k else ''}an integer not {i}")
raise ValueError(f"Expected an integer not {i}")
except AttributeError:
raise TypeError(
f"Could not convert {k + '=' if k else ''}{i} to an integer"
)
raise TypeError(f"Could not convert {i} to an integer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring _check_int to optionally convert floating-point numbers to integers if they represent whole numbers is a useful enhancement. However, ensure that this behavior is clearly documented to avoid confusion.

Consider adding documentation to clarify the behavior of the convert parameter.

Comment on lines 327 to 441
def add_row(self, *args, validate=None, skip_validate=None, **kwargs) -> int:
"""
Add a row to an IEdgeTable: the arguments are passed to the RowClass constructor
(with the RowClass being dataclass).

:param bool validate_child_adjacency: If True, check the child ID is either
the same as the child_id of the last edge or is a completely new id.
If False, do not perform this check and assume the user has checked already
which should be more efficient but could lead to later algorithmic failures.
If None (default) do not check, and set flags to indicate the tables
may not be a valid GIG.
:param bool validate_intervals: If True, check the child left is less than
child right and the absolute parent span is the same as the child span.
If False, do not perform this check and assume the user has
checked already (which should be more efficient but could lead to later
algorithmic failures). If None (default) do not check, and set flags to
indicate the tables may not be a valid GIG.
:param int validate: A set of bitflags (attributes of the ``ValidFlags`` class)
specifying which iedge table validation checks
should be performed when adding this data. If the existing data is valid, and
the new data is added in a way that preserves the existing validity, then
calling ``iedges.has_bitflag`` for the flags in this set will return True.
If any of the bits in ``iedges_validation`` are ``0``, that particular
validation will not be performed: in this case the ``has_bitflag`` method
will return False for certain flags, and some table algorithms will not run.
For instance, using the ``ids_for_child()`` method is only valid if
``IEDGES_FOR_CHILD_ADJACENT`` and ``IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC``
are set, so if you wish to use that method you should add those flags to
``validate``. Defaults to ``None`` which is treated as ``0``, meaning that
all ``IEDGE_...`` validation flags will be zeroed, no validation checks will
be performed, and hence the table will be marked as containing potentially
invalid iedge information.
:param bool skip_validate: If True, assume the user has checked that this
operation will pass the validation tests implied by the ``validate``
flags. This means that the validation routines will not be run, but the
tables may claim that they are valid when they are not. If False, and any of
the ``iedges_validation`` flags are set, perform the appropriate validations
(default: ``None`` treated as False)
:return: The row ID of the newly added row

Example:
new_id = iedges.add_row(cl, cr, pl, pr, c_id, p_id)
new_id = iedges.add_row(
cl, cr, pl, pr, child=c, parent=p, validate=ValidFlags.IEDGES_INTERVALS
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enhanced add_row method in IEdgeTable introduces validation flags and internal data structure updates. This is a significant improvement for ensuring data integrity and providing flexibility in validation. However, it's crucial to document the behavior and impact of the validate and skip_validate parameters for clarity.

Consider adding comprehensive documentation for the validate and skip_validate parameters to explain their usage and effects.

@hyanwong hyanwong force-pushed the tandem-duplications branch 2 times, most recently from 253d9b5 to fa674da Compare March 15, 2024 11:23
@hyanwong hyanwong force-pushed the tandem-duplications branch from fa674da to 99db17e Compare March 26, 2024 22:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0877a51 and 99db17e.
Files selected for processing (3)
  • requirements.txt (1 hunks)
  • tests/gigutil.py (8 hunks)
  • tests/test_gigutil.py (10 hunks)
Additional comments: 6
requirements.txt (1)
  • 5-5: Adding tqdm to the requirements is a good move for progress monitoring. However, consider specifying a version for tqdm to ensure compatibility and prevent potential future issues with updates.
tests/gigutil.py (3)
  • 3-3: The import of tqdm.auto is appropriate for compatibility with various environments. Good choice!
  • 169-169: Adding the progress_monitor parameter to enable progress monitoring during simulations is a valuable enhancement. This aligns well with the PR's objectives.

Also applies to: 217-218

  • 192-193: Utilizing tqdm for progress monitoring based on the progress_monitor parameter is implemented correctly. This enhances user interaction during simulations.

Also applies to: 208-209, 233-234, 238-239, 367-368, 379-380, 395-396, 400-401

tests/test_gigutil.py (2)
  • 10-10: The addition of constants and attributes such as DUPLICATION_CHILD_FLAG, duplication, progress_monitor, and iedge_validate supports the new tandem duplication feature and progress monitoring effectively.

Also applies to: 172-173, 176-176

  • 186-186: Including the progress_monitor parameter in the test_plain_sim, test_run_more, test_vs_tskit_implementation, and test_inversion methods aligns with the changes in gigutil.py and enhances the testing framework.

Also applies to: 210-210, 232-232, 342-342

Comment on lines 416 to 529
def test_tandem_duplication(self):
"""
Run a simulation in which a single tandem duplication is introduced then the
population explodes so that the duplication is not lost.

Unlike an inversion, we can't convert this to a tree sequence because
unequal recombination will keep generating new SVs during the lifetime
of the simulation.

Note that this routine can be called directly e.g.
from tests.test_gigutil import TestDTWF_one_break_no_rec_inversions_slow

test = TestDTWF_one_break_no_rec_inversions_slow()
test.progress_monitor = True # to show how the sim is progressing
test.default_gens = 40 # increase number of gens
test.test_tandem_duplication()
print(test.gig)
"""
self.simulator = sim.DTWF_one_break_no_rec_inversions_slow(self.iedge_validate)
self.simulator.run(
num_diploids=2,
seq_len=self.seq_len,
gens=1,
random_seed=1234, # chosen to give >= 3-fold tandem repeats after 5 gens
further_node_flags=np.array(
[[DUPLICATION_CHILD_FLAG, 0], [0, 0]], dtype=np.int32
),
progress_monitor=self.progress_monitor,
)
# Insert a tandem duplication by editing the tables
tables = self.simulator.tables
times, inverses = np.unique(tables.nodes.time, return_inverse=True)
assert len(times) == 3
first_gen = np.where(inverses == np.where([times == 1])[0])[0]
second_gen = np.where(inverses == np.where([times == 0])[0])[0]
assert len(first_gen) == 4 # haploid genomes
assert len(second_gen) == 4 # haploid genomes
# Edit the existing iedges to create a duplication in one child
new_tables = tables.copy(omit_iedges=True)
dup_child_id = np.where(tables.nodes.flags == DUPLICATION_CHILD_FLAG)[0]
assert len(dup_child_id) == 1
dup_child_id = dup_child_id[0]
assert dup_child_id in second_gen
for ie in tables.iedges:
if ie.child == dup_child_id:
assert ie.child_right == ie.parent_right
if ie.child_left == 0:
assert ie.parent_left == 0
new_tables.add_iedge_row(
0,
self.duplication.right,
0,
self.duplication.right,
child=dup_child_id,
parent=ie.parent,
**self.simulator.add_iedge_params(),
)
new_tables.add_iedge_row(
self.duplication.right,
self.seq_len + self.duplication.span,
self.duplication.left,
self.seq_len,
child=dup_child_id,
parent=ie.parent,
**self.simulator.add_iedge_params(),
)
else:
# Don't add in the edge for the other parent. This duplication
# will not be associated with a recombination
pass
else:
new_tables.add_iedge_row(
**ie._asdict(), **self.simulator.add_iedge_params()
)
new_tables.sort()
self.simulator.tables = new_tables
# Check it gives a valid gig
gig = self.simulator.tables.copy().graph()

# Can progress the simulation
gig = self.simulator.run_more(
num_diploids=100,
gens=self.default_gens - 1,
random_seed=1,
progress_monitor=self.progress_monitor,
)
# should have deleted the grand MRCA (used for matching)
assert len(gig.nodes) == len(self.simulator.tables.nodes) - 1

self.gig = gig
# Check that there are a number of duplicated regions:

lengths = [gig.max_position(u) for u in gig.samples]
unique_lengths = np.unique(lengths)
assert len(unique_lengths) > 2
assert np.all((np.diff(unique_lengths) % np.diff(self.duplication)) == 0)
# These should also form a tree that traces the duplicated regions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_tandem_duplication method is a valuable addition for simulating tandem duplications. Consider enhancing the method's documentation to explain the simulation steps more clearly, especially when editing the tables to insert a tandem duplication. Additionally, include assertions or checks to verify that the simulation results align with the expected outcomes, such as genome sizes differing by multiples of the duplication size.

Would you like me to help with enhancing the documentation or adding the necessary assertions?

@hyanwong hyanwong force-pushed the tandem-duplications branch from 99db17e to 448a3a8 Compare March 27, 2024 15:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0ff96a6 and 448a3a8.
Files selected for processing (3)
  • requirements.txt (1 hunks)
  • tests/gigutil.py (8 hunks)
  • tests/test_gigutil.py (10 hunks)
Files skipped from review as they are similar to previous changes (2)
  • requirements.txt
  • tests/gigutil.py
Additional comments: 9
tests/test_gigutil.py (9)
  • 10-10: The addition of the DUPLICATION_CHILD_FLAG constant is clear and follows the existing pattern established by INVERTED_CHILD_FLAG. This constant is presumably used to mark nodes in the simulation that result from tandem duplications.
  • 172-172: Adding a duplication attribute to the test class to specify the interval for tandem duplication is a good practice. It allows for easy adjustment of the duplication region for different tests.
  • 173-173: The addition of a progress_monitor attribute to the test class enables toggling progress monitoring during simulations. This is useful for visual feedback during long-running tests but can be disabled for faster execution or in non-interactive environments.
  • 176-176: Introducing an iedge_validate attribute to the test class allows for enabling or disabling validation of inheritance edges during simulations. This flexibility can be useful for testing different aspects of the simulation logic.
  • 180-186: The modification to include progress_monitor in the run method call of the test_plain_sim method integrates the new progress monitoring functionality into existing tests. Ensure that all tests that call run or run_more methods are updated accordingly to pass this parameter.
  • 197-211: Similar to the previous comment, the modification to include progress_monitor in the run_more method call of the test_run_more method correctly integrates the new progress monitoring functionality. Consistency in passing this parameter across all relevant tests is crucial.
  • 225-233: The inclusion of progress_monitor in the run method call within the test_vs_tskit_implementation method demonstrates consistent application of the new progress monitoring feature across the test suite.
  • 247-259: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [256-272]

The modification to include progress_monitor in the run method call of the test_inversion method is consistent with the integration of the new progress monitoring feature. This consistency is important for maintaining the usability of the test suite.

  • 425-529: The test_tandem_duplication method is a significant addition, simulating tandem duplications and verifying the outcomes. However, based on previous comments, enhancing the method's documentation could further clarify the simulation steps, especially regarding the editing of tables to insert a tandem duplication. Additionally, including assertions to verify that the simulation results align with expected outcomes, such as genome sizes differing by multiples of the duplication size, would strengthen the test's validity.

Given the previous feedback, consider enhancing the documentation and adding necessary assertions to ensure the method's clarity and effectiveness in simulating tandem duplications.

Should never have > 32767 chromosomes
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 448a3a8 and bcca655.
Files selected for processing (2)
  • GeneticInheritanceGraphLibrary/tables.py (7 hunks)
  • tests/test_tables.py (2 hunks)
Learnings (1)
tests/test_tables.py (1)
The `skip_validate` parameter's usage and implications have been clarified in the docstring of the `add_iedge_row` method in the `GeneticInheritanceGraphLibrary` project, as per the user's update.
Additional comments (7)
tests/test_tables.py (2)
  • 392-404: The test method test_add_iedge_allow_negative_edge_chrom introduces testing for adding edges with negative chromosome numbers, which could represent circular chromosomes. This is a valuable addition to ensure the library can handle such cases. However, it's important to ensure that the rest of the library and any downstream code can correctly interpret and handle these negative values without unintended consequences.
Verification successful

The shell script output indicates that the library does reference chromosome numbers, including in contexts that might involve negative values, as seen in the test case. However, the explicit handling of negative chromosome numbers, especially in the core library code (GeneticInheritanceGraphLibrary/tables.py), remains unclear. The mention of non-negative integers suggests a preference or assumption for non-negative values, but the implications of using negative values are not fully detailed.

To clarify the handling and implications of negative chromosome numbers in the library, further examination of the GeneticInheritanceGraphLibrary/tables.py file is necessary, focusing on the implementation details related to chromosome numbers.


The examination of tables.py confirms that the library is designed to handle chromosome numbers flexibly, including the use of negative values, as indicated by the test case and comments within the code. The storage of chromosome numbers as np.int16 and the explicit mention of negative chromosome numbers for circular chromosomes align with the test method test_add_iedge_allow_negative_edge_chrom. This suggests that the library and its tests are prepared to handle such cases, supporting the review comment's validity regarding the introduction of tests for negative chromosome numbers.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Search for handling of negative chromosome numbers in the library
rg --type py 'chromosome' | grep '-'

Length of output: 576



Script:

#!/bin/bash
# Examine the handling of chromosome numbers in tables.py, focusing on negative values
fd tables.py --exec cat {} | grep 'chromosome'

Length of output: 3948

* 427-438: The method `test_add_iedge_row_fail_negative` correctly tests for negative values in various parameters, ensuring that the library enforces non-negative constraints where appropriate. This is crucial for maintaining data integrity and preventing errors in genomic calculations. It's well-implemented and covers a comprehensive range of cases.
GeneticInheritanceGraphLibrary/tables.py (5)
  • 97-103: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [100-109]

The introduction of _non_int64_fieldtypes as a class attribute to specify field types other than int64 is a good approach for flexibility and optimization. However, it's important to ensure that the data types chosen are compatible with the operations performed on these fields throughout the codebase.

  • 298-301: Optimizing space by using np.int16 for child_chromosome and parent_chromosome in IEdgeTable is a sensible choice given the context. However, ensure that the range of np.int16 is sufficient for all possible values these fields might hold in the application's domain.
  • 715-715: Using np.float64 for the time field in NodeTable and np.uint32 for flags is appropriate given the nature of the data. This change enhances the precision and range of the time values and ensures that flags can accommodate a wide range of binary flags.
  • 765-765: The decision to use np.uint32 for flags in IndividualTable aligns with the need for a wide range of binary flags. This consistency in handling flags across different tables is good for maintainability.
  • 298-301: While optimizing space by using np.int16 for chromosome fields is a good practice, it's essential to ensure that the maximum value that can be stored in np.int16 is not exceeded in practical use cases. If there's a possibility of exceeding this range, consider using a larger data type.

Comment on lines +471 to +486
for i, val in enumerate(
(
edge,
child_left,
child_right,
parent_left,
parent_right,
child,
parent,
child_chromosome,
parent_chromosome,
child_left, # 0
child_right, # 1
parent_left, # 2
parent_right, # 3
child, # 4
parent, # 5
child_chromosome, # 6
parent_chromosome, # 7
edge, # 8
)
):
if int(i) != i:
if int(val) != val:
raise ValueError("Iedge data must be integers")
if is_edge != 0 and i < 0:
if i < 6 and val < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic in add_row method of IEdgeTable checks if the provided values are integers and non-negative (except for the edge ID). This is crucial for data integrity. However, consider extracting this validation logic into a separate method to improve readability and maintainability.

-        if validate is None:
-            validate = ~ValidFlags.IEDGES_ALL
-        # only try validating if any IEDGES flags are set
-        if (not skip_validate) and bool(validate & ValidFlags.IEDGES_ALL):
-            # need to check the values before they were put into the data array,
-            # as numpy silently converts floats to integers on assignment
-            if validate & ValidFlags.IEDGES_INTEGERS:
-                for i, val in enumerate(
-                    (
-                        child_left,  # 0
-                        child_right,  # 1
-                        parent_left,  # 2
-                        parent_right,  # 3
-                        child,  # 4
-                        parent,  # 5
-                        child_chromosome,  # 6
-                        parent_chromosome,  # 7
-                        edge,  # 8
-                    )
-                ):
-                    if int(val) != val:
-                        raise ValueError("Iedge data must be integers")
-                    if i < 6 and val < 0:
-                        raise ValueError(
-                            "Iedge data must be non-negative (except edge ID)"
-                        )
+        self._validate_iedge_data(validate, skip_validate, child_left, child_right, parent_left, parent_right, child, parent, child_chromosome, parent_chromosome, edge)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
for i, val in enumerate(
(
edge,
child_left,
child_right,
parent_left,
parent_right,
child,
parent,
child_chromosome,
parent_chromosome,
child_left, # 0
child_right, # 1
parent_left, # 2
parent_right, # 3
child, # 4
parent, # 5
child_chromosome, # 6
parent_chromosome, # 7
edge, # 8
)
):
if int(i) != i:
if int(val) != val:
raise ValueError("Iedge data must be integers")
if is_edge != 0 and i < 0:
if i < 6 and val < 0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant