-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tandem duplications #100
base: main
Are you sure you want to change the base?
Tandem duplications #100
Conversation
WalkthroughThis update enhances a genetic simulation utility, focusing on progress monitoring during simulations and supporting structural variations (SVs) like tandem duplications. It improves user experience by integrating Changes
Related issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (3)
- tests/gigutil.py (8 hunks)
- tests/test_gigutil.py (9 hunks)
- tests/test_graph.py (1 hunks)
Additional comments: 13
tests/test_graph.py (2)
- 253-258: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]
The removal of the
print(iedge_rows)
statement from thetest_all_svs_re_sample_resolve
method streamlines the output of the test case related toiedge_rows
verification. This change enhances the cleanliness of the test output, making it easier to identify relevant test results without the clutter of unnecessary print statements.
- 253-258: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]
Overall, the test code in
tests/test_graph.py
is well-structured and covers a wide range of scenarios for testing theGeneticInheritanceGraphLibrary
. The classes and methods are logically organized, and the tests appear to be thorough, ensuring that various aspects of the library are correctly validated. It's important to continue maintaining this level of detail and coverage as the library evolves.tests/gigutil.py (7)
- 3-3: The import statement for
tqdm.auto
is correctly added to support progress monitoring during simulations. This is in line with the PR objectives to enhance the testing framework with progress tracking capabilities.- 165-165: The addition of the
progress_monitor
parameter to therun
method in theDTWF_simulator
class is a good enhancement. It allows users to optionally enable progress tracking during simulations, which improves the interactivity and usability of the testing framework.- 188-189: Initializing the
tqdm
progress bar inside therun
method whenprogress_monitor
is provided is a practical approach. However, it's important to ensure that theprogress_monitor
parameter is documented in the method's docstring to inform users about its purpose and usage.Consider adding documentation for the
progress_monitor
parameter in the method's docstring to improve code readability and user guidance.
- 204-205: The call to
progress_monitor.update(1)
within the simulation loop correctly increments the progress bar for each generation simulated. This is a straightforward and effective way to provide feedback to the user on the simulation's progress.- 213-215: Adding the
progress_monitor
parameter to therun_more
method is consistent with the changes made to therun
method. It extends the progress monitoring feature to additional simulation runs, aligning with the PR's goal of enhancing user experience.- 229-230: Similar to the
run
method, initializing thetqdm
progress bar in therun_more
method whenprogress_monitor
is provided is appropriate. Ensure that the usage ofprogress_monitor
is documented in the method's docstring for clarity and consistency.Please document the
progress_monitor
parameter in therun_more
method's docstring to maintain consistency and provide clear guidance to users.
- 234-235: The update to the progress bar in the
run_more
method is correctly implemented, mirroring the approach in therun
method. This ensures that users receive consistent feedback across different simulation runs.tests/test_gigutil.py (4)
- 10-10: The addition of
DUPLICATION_CHILD_FLAG
constant is consistent with the existing pattern of defining flags for different types of child nodes. This is a good practice for maintaining readability and manageability of the code, especially when dealing with bitwise operations.- 172-172: The introduction of the
duplication
attribute as atskit.Interval
object is a clear and concise way to represent the interval of the genome affected by the duplication. This aligns well with the PR's objective to simulate and analyze the effects of tandem duplications on genome sizes.- 173-173: Adding a
progress_monitor
attribute to the test class and setting its default value toFalse
is a straightforward approach to enable or disable progress monitoring during simulations. This flexibility is beneficial for users who may want to run simulations with or without progress tracking.- 181-185: The updates to method calls to include the
progress_monitor
parameter are consistent with the enhancements made to the testing framework. This ensures that progress monitoring can be enabled or disabled as per the user's choice during the simulation runs. It's important to verify that all relevant method calls have been updated to include this parameter.Also applies to: 198-202, 205-210, 226-232
85951bb
to
1a9cec0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (4)
- requirements.txt (1 hunks)
- tests/gigutil.py (8 hunks)
- tests/test_gigutil.py (9 hunks)
- tests/test_graph.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
- tests/gigutil.py
- tests/test_graph.py
Additional comments: 4
requirements.txt (1)
- 5-5: The addition of
tqdm
aligns with the PR's goal of introducing progress monitoring during simulations. Ensure that the version oftqdm
added is compatible with the other packages and the overall project requirements.tests/test_gigutil.py (3)
- 10-10: The addition of
DUPLICATION_CHILD_FLAG
constant supports the new tandem duplication simulation feature. Ensure it's used consistently across the test suite.- 172-173: The addition of
duplication
andprogress_monitor
attributes in the test class enhances the simulation's analytical depth and user interaction. Verify that these attributes are utilized appropriately in the simulation methods.- 181-185: The updates to include the
progress_monitor
parameter in method calls align with the PR's objectives to enable progress tracking. Ensure that this parameter is correctly handled in each method to provide accurate progress updates.Also applies to: 198-202, 205-210, 226-232, 262-262, 338-341, 490-494
tests/test_gigutil.py
Outdated
def test_tandem_duplication(self): | ||
""" | ||
Run a simulation in which a single tandem duplication is introduced then the | ||
population explodes so that the duplication is not lost. | ||
|
||
Unlike an inversion, we can't convert this to a tree sequence because | ||
unequal recombination will keep generating new SVs during the lifetime | ||
of the simulation. | ||
|
||
Note that this routine can be called directly e.g. | ||
from tests.test_gigutil import TestDTWF_one_break_no_rec_inversions_slow | ||
|
||
test = TestDTWF_one_break_no_rec_inversions_slow() | ||
test.test_tandem_duplication() | ||
print(test.ts) | ||
""" | ||
self.simulator = sim.DTWF_one_break_no_rec_inversions_slow() | ||
self.simulator.run( | ||
num_diploids=2, | ||
seq_len=self.seq_len, | ||
gens=1, | ||
random_seed=123, | ||
further_node_flags=np.array( | ||
[[DUPLICATION_CHILD_FLAG, 0], [0, 0]], dtype=np.int32 | ||
), | ||
progress_monitor=self.progress_monitor, | ||
) | ||
# Insert a tandem duplication by editing the tables | ||
tables = self.simulator.tables | ||
times, inverses = np.unique(tables.nodes.time, return_inverse=True) | ||
assert len(times) == 3 | ||
first_gen = np.where(inverses == np.where([times == 1])[0])[0] | ||
second_gen = np.where(inverses == np.where([times == 0])[0])[0] | ||
assert len(first_gen) == 4 # haploid genomes | ||
assert len(second_gen) == 4 # haploid genomes | ||
# Edit the existing iedges to create a duplication in one child | ||
new_tables = tables.copy(omit_iedges=True) | ||
dup_child_id = np.where(tables.nodes.flags == DUPLICATION_CHILD_FLAG)[0] | ||
assert len(dup_child_id) == 1 | ||
dup_child_id = dup_child_id[0] | ||
assert dup_child_id in second_gen | ||
for ie in tables.iedges: | ||
if ie.child == dup_child_id: | ||
assert ie.child_right == ie.parent_right | ||
if ie.child_left == 0: | ||
assert ie.parent_left == 0 | ||
new_tables.add_iedge_row( | ||
0, | ||
self.duplication.right, | ||
0, | ||
self.duplication.right, | ||
child=dup_child_id, | ||
parent=ie.parent, | ||
**self.simulator.add_iedge_params(), | ||
) | ||
new_tables.add_iedge_row( | ||
self.duplication.right, | ||
self.seq_len + self.duplication.span, | ||
self.duplication.left, | ||
self.seq_len, | ||
child=dup_child_id, | ||
parent=ie.parent, | ||
**self.simulator.add_iedge_params(), | ||
) | ||
else: | ||
# Don't add in the edge for the other parent. This duplication | ||
# will not be associated with a recombination | ||
pass | ||
else: | ||
new_tables.add_iedge_row( | ||
**ie.asdict(), **self.simulator.add_iedge_params() | ||
) | ||
new_tables.sort() | ||
self.simulator.tables = new_tables | ||
# Check it gives a valid gig | ||
gig = self.simulator.tables.copy().graph() | ||
|
||
# Can progress the simulation | ||
gig = self.simulator.run_more( | ||
num_diploids=100, | ||
gens=self.default_gens - 1, | ||
random_seed=1, | ||
progress_monitor=self.progress_monitor, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_tandem_duplication
method is a valuable addition for simulating tandem duplications. Consider enhancing the method's documentation to explain the simulation steps more clearly, especially when editing the tables to insert a tandem duplication. Additionally, include assertions or checks to verify that the simulation results align with the expected outcomes, such as genome sizes differing by multiples of the duplication size.
1a9cec0
to
f4c2597
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (4)
- requirements.txt (1 hunks)
- tests/gigutil.py (8 hunks)
- tests/test_gigutil.py (10 hunks)
- tests/test_graph.py (1 hunks)
Additional comments: 15
requirements.txt (1)
- 5-5: The addition of
tqdm
torequirements.txt
is correctly placed and aligns with the PR's objective to introduce progress monitoring during simulations. Good job maintaining the alphabetical order of the package list.tests/test_graph.py (1)
- 253-258: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-399]
The tests in
test_graph.py
appear comprehensive, covering a wide range of scenarios and edge cases for theGeneticInheritanceGraphLibrary
. While the specific change mentioned in the summary (removal of aEnsure that the removal of the
tests/gigutil.py (3)
- 3-3: The import of
tqdm
fromtqdm.auto
is correctly done and aligns with the PR's objective to introduce progress monitoring during simulations.- 165-165: The introduction and usage of the
progress_monitor
parameter in therun
function are correctly implemented. Utilizingtqdm
for progress monitoring enhances user interaction and aligns with the PR's objectives.Also applies to: 188-189, 204-205
- 213-215: The introduction and usage of the
progress_monitor
parameter in therun_more
function are correctly implemented and consistent with therun
function. This enhances user interaction and aligns with the PR's objectives.Also applies to: 229-230, 234-235
tests/test_gigutil.py (10)
- 10-10: The addition of
DUPLICATION_CHILD_FLAG
constant is a good practice for maintaining readability and avoiding magic numbers in the code. It's important to ensure that this flag does not conflict with other existing flags.- 172-172: Adding a
duplication
attribute to the test class enhances the flexibility of the simulation by allowing the specification of a duplicated region. Ensure that the interval specified does not overlap with other critical regions unless intended for specific tests.- 173-173: The addition of a
progress_monitor
attribute is a useful feature for tracking the progress of simulations, especially for long-running processes. It's important to ensure that the progress monitoring does not significantly impact the performance of the simulations.- 176-176: Introducing an
iedge_validate
attribute to the test class is a good practice for enabling or disabling validation checks dynamically. This can be particularly useful for debugging or when running simulations under different conditions.- 180-186: The method
test_plain_sim
has been updated to include theprogress_monitor
parameter in the method call. This is a good practice for allowing dynamic control over progress monitoring. Ensure that all calls to this method throughout the codebase have been updated to include this parameter.- 197-211: Similarly, the method
test_run_more
has been updated to include theprogress_monitor
parameter. It's important to maintain consistency in how this parameter is handled across different methods to avoid confusion.- 225-233: The method
test_vs_tskit_implementation
has been updated to include theprogress_monitor
parameter. This consistency in handling the progress monitoring feature across tests is commendable.- 247-257: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [254-263]
The method
test_inversion
introduces an inversion simulation. It's crucial to ensure that the logic for inserting an inversion by editing the tables is correct and does not inadvertently affect other parts of the simulation. Additionally, consider adding more comments to explain the rationale behind specific choices, such as the selection offurther_node_flags
.
- 284-288: The conditional block checking if the breakpoint is far enough to the right is a good safety check. However, it's important to ensure that the error message provided is clear and actionable for the user. Consider providing suggestions on how to choose a different seed or adjust other parameters to meet the condition.
- 416-501: The
test_tandem_duplication
method introduces the simulation of tandem duplications, which is a key feature of this PR. The method is well-structured and includes progress monitoring. Ensure that the logic for inserting a tandem duplication by editing the tables is thoroughly tested, especially considering the complexity of genomic simulations. Additionally, consider adding more detailed comments to explain the steps involved in the simulation, particularly for the editing of tables to create a duplication.
f4c2597
to
cd554b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (11)
- GeneticInheritanceGraphLibrary/init.py (1 hunks)
- GeneticInheritanceGraphLibrary/constants.py (1 hunks)
- GeneticInheritanceGraphLibrary/graph.py (6 hunks)
- GeneticInheritanceGraphLibrary/tables.py (16 hunks)
- requirements.txt (1 hunks)
- tests/conftest.py (1 hunks)
- tests/gigutil.py (11 hunks)
- tests/test_constants.py (1 hunks)
- tests/test_gigutil.py (10 hunks)
- tests/test_graph.py (4 hunks)
- tests/test_tables.py (6 hunks)
Files skipped from review as they are similar to previous changes (2)
- requirements.txt
- tests/gigutil.py
Additional comments: 33
GeneticInheritanceGraphLibrary/__init__.py (2)
- 9-9: The change from a wildcard import to a specific import of
Const
is a good practice as it improves clarity and avoids namespace pollution.- 11-12: The introduction of
NULL
andNODE_IS_SAMPLE
constants is a good practice for code readability and maintainability. Ensure these constants are used consistently throughout the codebase.Verification successful
The usage of
NULL
andNODE_IS_SAMPLE
constants throughout the codebase has been verified. These constants are consistently used in both the library code and tests, aligning with good practices for code readability and maintainability.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify usage of NULL constant rg --type py 'Const\.NULL' # Verify usage of NODE_IS_SAMPLE constant rg --type py 'Const\.NODE_IS_SAMPLE'Length of output: 674
tests/test_constants.py (1)
- 1-20: The tests for
ValidFlags
are well-structured and cover various aspects of its functionality. Consider adding tests for any untested flag combinations to ensure comprehensive coverage.GeneticInheritanceGraphLibrary/constants.py (1)
- 7-122: The addition of the
Const
andValidFlags
classes enhances code readability and maintainability by defining constants and flags used across the codebase. Ensure these constants and flags are used consistently and correctly throughout the codebase.Verification successful
The usage of the
Const
andValidFlags
classes across various parts of the codebase, including tests and library modules, supports the initial assessment that these additions enhance code readability and maintainability. The evidence indicates that these classes are being used consistently and correctly throughout the codebase.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify usage of Const class constants rg --type py 'Const\.[A-Z_]+' # Verify usage of ValidFlags rg --type py 'ValidFlags\.[A-Z_]+'Length of output: 10293
tests/conftest.py (1)
- 209-223: The adjustments in the
trivial_gig
function refine the test setup by modifying node data and edge definitions. Ensure these changes do not inadvertently affect the intended test outcomes.GeneticInheritanceGraphLibrary/graph.py (2)
- 13-14: Refactoring to use
Const
andValidFlags
from theconstants
module enhances clarity and consistency. Ensure these changes are applied consistently throughout the file.Verification successful
The usage of
Const
andValidFlags
throughout the codebase, including in tests and main library files, demonstrates a consistent application of these constants. This aligns with the goal of enhancing clarity and consistency in the code. The refactoring appears to have been successfully applied across various parts of the project.* 108-108: The usage of `Const` values in method calls improves code clarity and consistency. Verify their correct application in all relevant method calls to ensure consistency across the codebase.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify consistent application of Const and ValidFlags rg --type py 'Const\.[A-Z_]+' rg --type py 'ValidFlags\.[A-Z_]+'Length of output: 10293
Also applies to: 150-150, 340-340, 441-441, 449-449
Verification successful
The script output demonstrates a consistent application of
Const
values in method calls across various parts of the codebase, including both tests and the main library files. This suggests that the usage ofConst
values for clarity and consistency, as mentioned in the review comment, has been correctly applied in the observed instances. Without specific counterexamples or a broader analysis of the entire codebase, it's reasonable to conclude that the review comment is verified based on the available information.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify correct application of Const values in method calls rg --type py 'Const\.[A-Z_]+'Length of output: 4008
tests/test_gigutil.py (5)
- 10-10: Introduced a new constant
DUPLICATION_CHILD_FLAG
for use in tandem duplication simulations.This addition enhances the clarity and maintainability of the code by using a descriptive constant instead of a hard-coded value.
- 172-172: Added a
duplication
attribute to the test class to represent the interval affected by tandem duplication.This is a good practice as it centralizes the definition of the duplication interval, making it easier to adjust for different tests if needed.
- 173-173: Introduced a
progress_monitor
attribute to control the display of progress during simulations.This addition improves user interaction by providing an option to monitor the progress of long-running simulations.
- 176-176: Added an
iedge_validate
attribute to enable or disable validation of iedges during simulations.This flexibility is beneficial for testing, allowing for performance optimizations by skipping validation when it's known to be unnecessary.
- 416-507: Added a new method
test_tandem_duplication
to simulate and test the impact of tandem duplications on genome sizes.This method is a significant addition to the testing framework, enabling the simulation of genomic variations due to tandem duplications. It's well-structured and includes detailed comments explaining the process and its implications.
tests/test_tables.py (5)
- 137-144: The assertions in the
TestExtractColumn
class have been updated to check for specific values in the tables' nodes and edges. This is a critical part of ensuring the integrity of the data structure after operations such as additions or modifications. It's important to verify that these values align with the expected outcomes of the operations being tested, especially in the context of genomic simulations where accuracy is paramount.- 217-219: The use of
ValidFlags.GIG
to assert the flags before and after adding a row to theiedges
table is a good practice for ensuring that the flags are correctly set and maintained. However, it's crucial to ensure that theValidFlags.GIG
constant accurately represents the expected state of the flags in all scenarios where it's used.- 225-231: The introduction of a
params
dictionary to pass additional parameters and validation flags to theadd_iedge_row
method enhances the flexibility and readability of the test. It's important to ensure that all possible combinations of flags and parameters are thoroughly tested to cover different scenarios and edge cases.- 342-360: The parameterized test
test_flags
in theTestIedgesValidation
class is a robust way to validate the behavior of theadd_iedge_row
method under various flags and conditions. It's essential to ensure comprehensive coverage of all relevant flags and to verify that the assertions accurately reflect the expected outcomes.- 361-367: The test
test_bad_flags
correctly identifies scenarios where invalid flags involving the node table are used. This is crucial for ensuring the integrity of the edge table operations and preventing unintended modifications to the node table. It's important to verify that the error messages are clear and informative for debugging purposes.GeneticInheritanceGraphLibrary/tables.py (16)
- 12-13: Renaming constants to
Const
andValidFlags
improves readability and standardizes naming conventions across the project.- 16-17: Using aliases for
tskit.NULL
andtskit.NODE_IS_SAMPLE
throughConst
enhances code clarity and maintainability.- 44-45: Setting default values for
child_chromosome
andparent_chromosome
inIEdgeTableRow
to0
is a sensible choice, ensuring that these fields have a predictable initial state.- 91-91: Assigning
Const.NULL
as the default value forindividual
inNodeTableRow
aligns with the use of constants for default values, promoting consistency.- 284-297: Initializing and copying logic in
IEdgeTable
with deep copying of_id_range_for_child
ensures that internal data structures are correctly managed during copy operations. This is crucial for maintaining the integrity of data structures that track relationships between entities.- 304-307: The initialization of
flags
and_id_range_for_child
inIEdgeTable.clear
method ensures that the table is reset to a valid and consistent state. This is important for maintaining the integrity of the table's data.- 388-472: Adding the
_validate_add_row
method inIEdgeTable
for enhanced validation checks is a positive step towards ensuring data integrity. It's important to ensure that all validation logic is thoroughly tested, especially for complex conditions involving chromosome IDs and interval overlaps.- 499-514: The
_check_ints
method inIEdgeTable
improves validation by checking multiple values for integer constraints. This method enhances data integrity by ensuring that critical fields are integers. It's a good practice to ensure that this method is used consistently wherever applicable.- 531-538: Updating the
ids_for_child
method inIEdgeTable
to handle chromosome IDs adds flexibility and specificity to querying by child and chromosome. This enhancement improves the utility of the method for more detailed analyses.- 545-565: Adding the
max_child_pos
method inIEdgeTable
is a valuable addition for efficiently determining the maximum child position for a given child ID and chromosome. This method enhances the table's capabilities for analyzing genomic data.- 619-624: Introducing a property for accessing child chromosome IDs as a numpy array in
IEdgeTable
facilitates efficient data manipulation and analysis. This is a useful enhancement for working with genomic data.- 640-645: Similarly, adding a property for accessing parent chromosome IDs as a numpy array in
IEdgeTable
is beneficial for consistency and ease of data analysis.- 867-892: The
_validate_add_iedge_row
method inTables
class introduces additional validation checks related to node times. This method enhances the robustness of data integrity checks, especially for ensuring that parent nodes are older than child nodes. It's important to ensure comprehensive testing for these time-based validations.- 894-918: Enhancing the
add_iedge_row
method inTables
class to include node-specific validation checks is a significant improvement for ensuring data integrity. This method provides a more comprehensive approach to validation by considering both iedge and node properties.- 977-983: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [980-989]
The update to the
sort
method inTables
class to include sorting by child chromosome is a thoughtful addition that enhances the sorting logic. This change ensures that iedges are sorted in a more detailed and meaningful way, which is important for data consistency and analysis.
- 994-1013: The documentation for the
graph
method inTables
class provides clear conditions for generating a genetic inheritance graph. This documentation is crucial for users to understand the requirements and limitations of converting tables into a GIG. It's important to ensure that the documentation is kept up-to-date with any changes to the validation logic or table structure.
def _check_int(i, convert=False): | ||
if isinstance(i, (int, np.integer)): | ||
return i | ||
try: | ||
if i.is_integer(): | ||
if convert and i.is_integer(): | ||
return int(i) | ||
raise ValueError(f"Expected {k + ' to be ' if k else ''}an integer not {i}") | ||
raise ValueError(f"Expected an integer not {i}") | ||
except AttributeError: | ||
raise TypeError( | ||
f"Could not convert {k + '=' if k else ''}{i} to an integer" | ||
) | ||
raise TypeError(f"Could not convert {i} to an integer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring _check_int
to optionally convert floating-point numbers to integers if they represent whole numbers is a useful enhancement. However, ensure that this behavior is clearly documented to avoid confusion.
Consider adding documentation to clarify the behavior of the convert
parameter.
def add_row(self, *args, validate=None, skip_validate=None, **kwargs) -> int: | ||
""" | ||
Add a row to an IEdgeTable: the arguments are passed to the RowClass constructor | ||
(with the RowClass being dataclass). | ||
|
||
:param bool validate_child_adjacency: If True, check the child ID is either | ||
the same as the child_id of the last edge or is a completely new id. | ||
If False, do not perform this check and assume the user has checked already | ||
which should be more efficient but could lead to later algorithmic failures. | ||
If None (default) do not check, and set flags to indicate the tables | ||
may not be a valid GIG. | ||
:param bool validate_intervals: If True, check the child left is less than | ||
child right and the absolute parent span is the same as the child span. | ||
If False, do not perform this check and assume the user has | ||
checked already (which should be more efficient but could lead to later | ||
algorithmic failures). If None (default) do not check, and set flags to | ||
indicate the tables may not be a valid GIG. | ||
:param int validate: A set of bitflags (attributes of the ``ValidFlags`` class) | ||
specifying which iedge table validation checks | ||
should be performed when adding this data. If the existing data is valid, and | ||
the new data is added in a way that preserves the existing validity, then | ||
calling ``iedges.has_bitflag`` for the flags in this set will return True. | ||
If any of the bits in ``iedges_validation`` are ``0``, that particular | ||
validation will not be performed: in this case the ``has_bitflag`` method | ||
will return False for certain flags, and some table algorithms will not run. | ||
For instance, using the ``ids_for_child()`` method is only valid if | ||
``IEDGES_FOR_CHILD_ADJACENT`` and ``IEDGES_FOR_CHILD_PRIMARY_ORDER_CHR_ASC`` | ||
are set, so if you wish to use that method you should add those flags to | ||
``validate``. Defaults to ``None`` which is treated as ``0``, meaning that | ||
all ``IEDGE_...`` validation flags will be zeroed, no validation checks will | ||
be performed, and hence the table will be marked as containing potentially | ||
invalid iedge information. | ||
:param bool skip_validate: If True, assume the user has checked that this | ||
operation will pass the validation tests implied by the ``validate`` | ||
flags. This means that the validation routines will not be run, but the | ||
tables may claim that they are valid when they are not. If False, and any of | ||
the ``iedges_validation`` flags are set, perform the appropriate validations | ||
(default: ``None`` treated as False) | ||
:return: The row ID of the newly added row | ||
|
||
Example: | ||
new_id = iedges.add_row(cl, cr, pl, pr, c_id, p_id) | ||
new_id = iedges.add_row( | ||
cl, cr, pl, pr, child=c, parent=p, validate=ValidFlags.IEDGES_INTERVALS | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhanced add_row
method in IEdgeTable
introduces validation flags and internal data structure updates. This is a significant improvement for ensuring data integrity and providing flexibility in validation. However, it's crucial to document the behavior and impact of the validate
and skip_validate
parameters for clarity.
Consider adding comprehensive documentation for the validate
and skip_validate
parameters to explain their usage and effects.
253d9b5
to
fa674da
Compare
fa674da
to
99db17e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (3)
- requirements.txt (1 hunks)
- tests/gigutil.py (8 hunks)
- tests/test_gigutil.py (10 hunks)
Additional comments: 6
requirements.txt (1)
- 5-5: Adding
tqdm
to the requirements is a good move for progress monitoring. However, consider specifying a version fortqdm
to ensure compatibility and prevent potential future issues with updates.tests/gigutil.py (3)
- 3-3: The import of
tqdm.auto
is appropriate for compatibility with various environments. Good choice!- 169-169: Adding the
progress_monitor
parameter to enable progress monitoring during simulations is a valuable enhancement. This aligns well with the PR's objectives.Also applies to: 217-218
- 192-193: Utilizing
tqdm
for progress monitoring based on theprogress_monitor
parameter is implemented correctly. This enhances user interaction during simulations.Also applies to: 208-209, 233-234, 238-239, 367-368, 379-380, 395-396, 400-401
tests/test_gigutil.py (2)
- 10-10: The addition of constants and attributes such as
DUPLICATION_CHILD_FLAG
,duplication
,progress_monitor
, andiedge_validate
supports the new tandem duplication feature and progress monitoring effectively.Also applies to: 172-173, 176-176
- 186-186: Including the
progress_monitor
parameter in thetest_plain_sim
,test_run_more
,test_vs_tskit_implementation
, andtest_inversion
methods aligns with the changes ingigutil.py
and enhances the testing framework.Also applies to: 210-210, 232-232, 342-342
tests/test_gigutil.py
Outdated
def test_tandem_duplication(self): | ||
""" | ||
Run a simulation in which a single tandem duplication is introduced then the | ||
population explodes so that the duplication is not lost. | ||
|
||
Unlike an inversion, we can't convert this to a tree sequence because | ||
unequal recombination will keep generating new SVs during the lifetime | ||
of the simulation. | ||
|
||
Note that this routine can be called directly e.g. | ||
from tests.test_gigutil import TestDTWF_one_break_no_rec_inversions_slow | ||
|
||
test = TestDTWF_one_break_no_rec_inversions_slow() | ||
test.progress_monitor = True # to show how the sim is progressing | ||
test.default_gens = 40 # increase number of gens | ||
test.test_tandem_duplication() | ||
print(test.gig) | ||
""" | ||
self.simulator = sim.DTWF_one_break_no_rec_inversions_slow(self.iedge_validate) | ||
self.simulator.run( | ||
num_diploids=2, | ||
seq_len=self.seq_len, | ||
gens=1, | ||
random_seed=1234, # chosen to give >= 3-fold tandem repeats after 5 gens | ||
further_node_flags=np.array( | ||
[[DUPLICATION_CHILD_FLAG, 0], [0, 0]], dtype=np.int32 | ||
), | ||
progress_monitor=self.progress_monitor, | ||
) | ||
# Insert a tandem duplication by editing the tables | ||
tables = self.simulator.tables | ||
times, inverses = np.unique(tables.nodes.time, return_inverse=True) | ||
assert len(times) == 3 | ||
first_gen = np.where(inverses == np.where([times == 1])[0])[0] | ||
second_gen = np.where(inverses == np.where([times == 0])[0])[0] | ||
assert len(first_gen) == 4 # haploid genomes | ||
assert len(second_gen) == 4 # haploid genomes | ||
# Edit the existing iedges to create a duplication in one child | ||
new_tables = tables.copy(omit_iedges=True) | ||
dup_child_id = np.where(tables.nodes.flags == DUPLICATION_CHILD_FLAG)[0] | ||
assert len(dup_child_id) == 1 | ||
dup_child_id = dup_child_id[0] | ||
assert dup_child_id in second_gen | ||
for ie in tables.iedges: | ||
if ie.child == dup_child_id: | ||
assert ie.child_right == ie.parent_right | ||
if ie.child_left == 0: | ||
assert ie.parent_left == 0 | ||
new_tables.add_iedge_row( | ||
0, | ||
self.duplication.right, | ||
0, | ||
self.duplication.right, | ||
child=dup_child_id, | ||
parent=ie.parent, | ||
**self.simulator.add_iedge_params(), | ||
) | ||
new_tables.add_iedge_row( | ||
self.duplication.right, | ||
self.seq_len + self.duplication.span, | ||
self.duplication.left, | ||
self.seq_len, | ||
child=dup_child_id, | ||
parent=ie.parent, | ||
**self.simulator.add_iedge_params(), | ||
) | ||
else: | ||
# Don't add in the edge for the other parent. This duplication | ||
# will not be associated with a recombination | ||
pass | ||
else: | ||
new_tables.add_iedge_row( | ||
**ie._asdict(), **self.simulator.add_iedge_params() | ||
) | ||
new_tables.sort() | ||
self.simulator.tables = new_tables | ||
# Check it gives a valid gig | ||
gig = self.simulator.tables.copy().graph() | ||
|
||
# Can progress the simulation | ||
gig = self.simulator.run_more( | ||
num_diploids=100, | ||
gens=self.default_gens - 1, | ||
random_seed=1, | ||
progress_monitor=self.progress_monitor, | ||
) | ||
# should have deleted the grand MRCA (used for matching) | ||
assert len(gig.nodes) == len(self.simulator.tables.nodes) - 1 | ||
|
||
self.gig = gig | ||
# Check that there are a number of duplicated regions: | ||
|
||
lengths = [gig.max_position(u) for u in gig.samples] | ||
unique_lengths = np.unique(lengths) | ||
assert len(unique_lengths) > 2 | ||
assert np.all((np.diff(unique_lengths) % np.diff(self.duplication)) == 0) | ||
# These should also form a tree that traces the duplicated regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_tandem_duplication
method is a valuable addition for simulating tandem duplications. Consider enhancing the method's documentation to explain the simulation steps more clearly, especially when editing the tables to insert a tandem duplication. Additionally, include assertions or checks to verify that the simulation results align with the expected outcomes, such as genome sizes differing by multiples of the duplication size.
Would you like me to help with enhancing the documentation or adding the necessary assertions?
And add a progress monitor for independent testing
99db17e
to
448a3a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (3)
- requirements.txt (1 hunks)
- tests/gigutil.py (8 hunks)
- tests/test_gigutil.py (10 hunks)
Files skipped from review as they are similar to previous changes (2)
- requirements.txt
- tests/gigutil.py
Additional comments: 9
tests/test_gigutil.py (9)
- 10-10: The addition of the
DUPLICATION_CHILD_FLAG
constant is clear and follows the existing pattern established byINVERTED_CHILD_FLAG
. This constant is presumably used to mark nodes in the simulation that result from tandem duplications.- 172-172: Adding a
duplication
attribute to the test class to specify the interval for tandem duplication is a good practice. It allows for easy adjustment of the duplication region for different tests.- 173-173: The addition of a
progress_monitor
attribute to the test class enables toggling progress monitoring during simulations. This is useful for visual feedback during long-running tests but can be disabled for faster execution or in non-interactive environments.- 176-176: Introducing an
iedge_validate
attribute to the test class allows for enabling or disabling validation of inheritance edges during simulations. This flexibility can be useful for testing different aspects of the simulation logic.- 180-186: The modification to include
progress_monitor
in therun
method call of thetest_plain_sim
method integrates the new progress monitoring functionality into existing tests. Ensure that all tests that callrun
orrun_more
methods are updated accordingly to pass this parameter.- 197-211: Similar to the previous comment, the modification to include
progress_monitor
in therun_more
method call of thetest_run_more
method correctly integrates the new progress monitoring functionality. Consistency in passing this parameter across all relevant tests is crucial.- 225-233: The inclusion of
progress_monitor
in therun
method call within thetest_vs_tskit_implementation
method demonstrates consistent application of the new progress monitoring feature across the test suite.- 247-259: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [256-272]
The modification to include
progress_monitor
in therun
method call of thetest_inversion
method is consistent with the integration of the new progress monitoring feature. This consistency is important for maintaining the usability of the test suite.
- 425-529: The
test_tandem_duplication
method is a significant addition, simulating tandem duplications and verifying the outcomes. However, based on previous comments, enhancing the method's documentation could further clarify the simulation steps, especially regarding the editing of tables to insert a tandem duplication. Additionally, including assertions to verify that the simulation results align with expected outcomes, such as genome sizes differing by multiples of the duplication size, would strengthen the test's validity.Given the previous feedback, consider enhancing the documentation and adding necessary assertions to ensure the method's clarity and effectiveness in simulating tandem duplications.
Should never have > 32767 chromosomes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (2)
- GeneticInheritanceGraphLibrary/tables.py (7 hunks)
- tests/test_tables.py (2 hunks)
Learnings (1)
tests/test_tables.py (1)
The `skip_validate` parameter's usage and implications have been clarified in the docstring of the `add_iedge_row` method in the `GeneticInheritanceGraphLibrary` project, as per the user's update.
Additional comments (7)
tests/test_tables.py (2)
- 392-404: The test method
test_add_iedge_allow_negative_edge_chrom
introduces testing for adding edges with negative chromosome numbers, which could represent circular chromosomes. This is a valuable addition to ensure the library can handle such cases. However, it's important to ensure that the rest of the library and any downstream code can correctly interpret and handle these negative values without unintended consequences.Verification successful
The shell script output indicates that the library does reference chromosome numbers, including in contexts that might involve negative values, as seen in the test case. However, the explicit handling of negative chromosome numbers, especially in the core library code (
GeneticInheritanceGraphLibrary/tables.py
), remains unclear. The mention of non-negative integers suggests a preference or assumption for non-negative values, but the implications of using negative values are not fully detailed.To clarify the handling and implications of negative chromosome numbers in the library, further examination of the
GeneticInheritanceGraphLibrary/tables.py
file is necessary, focusing on the implementation details related to chromosome numbers.
The examination of
tables.py
confirms that the library is designed to handle chromosome numbers flexibly, including the use of negative values, as indicated by the test case and comments within the code. The storage of chromosome numbers asnp.int16
and the explicit mention of negative chromosome numbers for circular chromosomes align with the test methodtest_add_iedge_allow_negative_edge_chrom
. This suggests that the library and its tests are prepared to handle such cases, supporting the review comment's validity regarding the introduction of tests for negative chromosome numbers.* 427-438: The method `test_add_iedge_row_fail_negative` correctly tests for negative values in various parameters, ensuring that the library enforces non-negative constraints where appropriate. This is crucial for maintaining data integrity and preventing errors in genomic calculations. It's well-implemented and covers a comprehensive range of cases.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for handling of negative chromosome numbers in the library rg --type py 'chromosome' | grep '-'Length of output: 576
Script:
#!/bin/bash # Examine the handling of chromosome numbers in tables.py, focusing on negative values fd tables.py --exec cat {} | grep 'chromosome'Length of output: 3948
GeneticInheritanceGraphLibrary/tables.py (5)
- 97-103: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [100-109]
The introduction of
_non_int64_fieldtypes
as a class attribute to specify field types other thanint64
is a good approach for flexibility and optimization. However, it's important to ensure that the data types chosen are compatible with the operations performed on these fields throughout the codebase.
- 298-301: Optimizing space by using
np.int16
forchild_chromosome
andparent_chromosome
inIEdgeTable
is a sensible choice given the context. However, ensure that the range ofnp.int16
is sufficient for all possible values these fields might hold in the application's domain.- 715-715: Using
np.float64
for thetime
field inNodeTable
andnp.uint32
forflags
is appropriate given the nature of the data. This change enhances the precision and range of thetime
values and ensures thatflags
can accommodate a wide range of binary flags.- 765-765: The decision to use
np.uint32
forflags
inIndividualTable
aligns with the need for a wide range of binary flags. This consistency in handling flags across different tables is good for maintainability.- 298-301: While optimizing space by using
np.int16
for chromosome fields is a good practice, it's essential to ensure that the maximum value that can be stored innp.int16
is not exceeded in practical use cases. If there's a possibility of exceeding this range, consider using a larger data type.
for i, val in enumerate( | ||
( | ||
edge, | ||
child_left, | ||
child_right, | ||
parent_left, | ||
parent_right, | ||
child, | ||
parent, | ||
child_chromosome, | ||
parent_chromosome, | ||
child_left, # 0 | ||
child_right, # 1 | ||
parent_left, # 2 | ||
parent_right, # 3 | ||
child, # 4 | ||
parent, # 5 | ||
child_chromosome, # 6 | ||
parent_chromosome, # 7 | ||
edge, # 8 | ||
) | ||
): | ||
if int(i) != i: | ||
if int(val) != val: | ||
raise ValueError("Iedge data must be integers") | ||
if is_edge != 0 and i < 0: | ||
if i < 6 and val < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation logic in add_row
method of IEdgeTable
checks if the provided values are integers and non-negative (except for the edge ID). This is crucial for data integrity. However, consider extracting this validation logic into a separate method to improve readability and maintainability.
- if validate is None:
- validate = ~ValidFlags.IEDGES_ALL
- # only try validating if any IEDGES flags are set
- if (not skip_validate) and bool(validate & ValidFlags.IEDGES_ALL):
- # need to check the values before they were put into the data array,
- # as numpy silently converts floats to integers on assignment
- if validate & ValidFlags.IEDGES_INTEGERS:
- for i, val in enumerate(
- (
- child_left, # 0
- child_right, # 1
- parent_left, # 2
- parent_right, # 3
- child, # 4
- parent, # 5
- child_chromosome, # 6
- parent_chromosome, # 7
- edge, # 8
- )
- ):
- if int(val) != val:
- raise ValueError("Iedge data must be integers")
- if i < 6 and val < 0:
- raise ValueError(
- "Iedge data must be non-negative (except edge ID)"
- )
+ self._validate_iedge_data(validate, skip_validate, child_left, child_right, parent_left, parent_right, child, parent, child_chromosome, parent_chromosome, edge)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
for i, val in enumerate( | |
( | |
edge, | |
child_left, | |
child_right, | |
parent_left, | |
parent_right, | |
child, | |
parent, | |
child_chromosome, | |
parent_chromosome, | |
child_left, # 0 | |
child_right, # 1 | |
parent_left, # 2 | |
parent_right, # 3 | |
child, # 4 | |
parent, # 5 | |
child_chromosome, # 6 | |
parent_chromosome, # 7 | |
edge, # 8 | |
) | |
): | |
if int(i) != i: | |
if int(val) != val: | |
raise ValueError("Iedge data must be integers") | |
if is_edge != 0 and i < 0: | |
if i < 6 and val < 0: |
WIP. You can test by running the following code:
For me, this takes about 10 mins, and results in a list of genome sizes that have lengths that differ by multiples of 450bp (the duplication size)
Summary by CodeRabbit
tqdm
package for progress tracking.