Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDAP-20884: Multi push operation #15385

Merged
merged 1 commit into from
Nov 9, 2023
Merged

CDAP-20884: Multi push operation #15385

merged 1 commit into from
Nov 9, 2023

Conversation

GnsP
Copy link
Contributor

@GnsP GnsP commented Oct 27, 2023

Implement Push LRO

  • Fetches applications one by one from db/api and writes it to local repo
  • The operation is atomic i.e the push happens after all app changes has been committed ( in one commit)

@samdgupi samdgupi changed the base branch from develop to CDAP-20862-helper October 27, 2023 07:35
@samdgupi samdgupi added 6.10 build Triggers github actions build labels Oct 27, 2023
@samdgupi samdgupi force-pushed the multi-push-operation branch from 8481aa5 to 2b6800b Compare October 27, 2023 07:47
@samdgupi samdgupi force-pushed the multi-push-operation branch 2 times, most recently from 9834df2 to a64a412 Compare October 27, 2023 18:50
appVersions.add(new AppVersion(detail.getName(), detail.getAppVersion()));
} catch (IOException | NotFoundException e) {
throw new SourceControlException(
String.format("Failed to fetch details for app %s", appRef));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always retain the original exception as the cause when rethrowing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -227,8 +228,8 @@ public static void validateConfig(final SecureStore secureStore,
* @throws SourceControlException when failed to get the fileHash before
* push
*/
public String commitAndPush(final CommitMeta commitMeta,
final Path fileChanged)
public Map<Path, String> commitAndPush(final CommitMeta commitMeta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a class for the result instead of Map<Path, String> to better represent what the result is and easier to include more information in future.

Map<Path, String> fileHashes = new HashMap<>();
for (Path fileChanged : filesChanged) {
try {
String fileHash = getFileHash(fileChanged, commit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like should have a batch getFileHashes method to do it more efficiently by doing a single TreeWalk instead of scanning it multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Added a new method getFileHashes to get the hashes using a single TreeWalk.

repositoryManager.cloneRemote();
} catch (GitAPIException | IOException e) {
throw new GitOperationException(String.format("Failed to clone remote repository: %s",
e.getMessage()), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically when wrapping an exception, you don't use the e.getMessage() to form the wrapping message, otherwise it just result it duplicated message in the stacktrace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Base automatically changed from CDAP-20862-helper to develop October 31, 2023 04:46
@GnsP GnsP force-pushed the multi-push-operation branch 4 times, most recently from 9db1dba to b970e5f Compare November 1, 2023 06:46
// update git metadata for the pushed application
applicationManager.updateSourceControlMeta(namespaceId, getUpdateMetaRequest(responses));
} catch (NotFoundException | BadRequestException | IOException | SourceControlException e) {
throw new OperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a problem in the existing PullAppsOperation too, but OperationException should take another exception as the cause. As it is now, the root cause will be lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the OperationException class constructor to take the cause too. Updated the usage in PullAppsOperation.

try (TreeWalk walk = new TreeWalk(git.getRepository())) {
walk.addTree(commit.getTree());
walk.setRecursive(true);
if (relativePaths.size() >= 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change to > 1 for consistency with rest of the project

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

String fileHash = walk.getObjectId(0).getName();
Path filePath = Paths.get(walk.getPathString());
if (fileHash == null) {
throw new SourceControlException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this actually happen? If so, when? Would be good to add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception should never happen, as the getName method on jgit.lib.ObjectId always returns a non-null string. Removed this check (unnecessary).

@@ -417,6 +422,39 @@ private String getFileHash(final Path relativePath, final RevCommit commit)
}
}

private Map<Path, String> getFileHashes(Set<Path> relativePaths, RevCommit commit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a Path in the input doesn't actually exist, what is the expected behavior? Does it throw an exception? Is it just not returned in the Map? In the Map but with empty file hash? Should add a javadoc describing the behavior, and also when each type of exception is thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a Path in the input does not actually exist, then it's not included in the Map at all. Only the paths that are found in the filtered TreeWalk are added to the Map.

This method does not throw any exception in this case. This case is handled in the method that uses it (i.e. the commitAndPush method). If the returned map has different number of entries than expected (based on the input), then a SourceControlException is thrown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this information as a javadoc for the method.

walk.addTree(commit.getTree());
walk.setRecursive(true);
if (relativePaths.size() >= 2) {
walk.setFilter(OrTreeFilter.create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the git library we're using, but I feel like this would likely be implemented in an inefficient way, where it does N comparisons for each element in the walk.

It would be more efficient and simpler (don't need the if-else) to create a PathFilter that just checks if the path is in the input set with Set.contains().

Copy link
Contributor Author

@GnsP GnsP Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the library implementation. Yes, it does N comparisons for each element in the walk.

Updated the method to filter the paths during the tree traversal based on their existence in the set of relative paths.

throws NoChangesToPushException {
private void cloneAndCreateBaseDir(RepositoryManager repositoryManager) {
try {
repositoryManager.cloneRemote();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this clone into the base directory? Confused why we need to create the directory later if it was already cloned here.

Copy link
Contributor Author

@GnsP GnsP Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repository root directory (this gets cloned) and the scm base directory are different. The scm base directory is a directory inside the repository root, configurable per namespace. This base directory may not exist if the user is pushing apps from the namespace for the first time. That's why we need to create it, if it does not exist already.

) {
cloneAndCreateBaseDir(repositoryManager);

LOG.info("Pushing application configs for : {}", pushRequest.getApps());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a really long log line, it's better to just log the number of apps that will be pushed. If we want to log each one, have a separate log statement within the loop below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

) throws NoChangesToPushException {
// get relative paths for apps to be pushed
// we pass a set as we should have list of unique paths.
Map<AppVersion, Path> appRelativePaths = appsToPush.stream().collect(Collectors.toMap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this method, we're creating this Map, another gitFileHashes Map, and a new HashSet that's passed into commitAndPush(). That's a lot of extra passes through the data, a bit of extra memory, and extra mapping complexity that makes the method harder to understand. If the code needs to do a bunch of converting back and forth of collections, it's an indication that the APIs probably need improvement.

For example, this could all be done in one pass if we change commitAndPush() to look like:

public <T extends Supplier<Path>> void commitAndPush(Iterator<T> commitFiles, BiConsumer<T, String> consumer);

and call it something like:

private static class PushFile implements Supplier<Path> {
  private final AppVersion appVersion;
  private final Path path;

  public Path get() {
    return path;
  }
}

Iterator<PushFile> iter = Iterators.transform(appsToPush, appToPush -> new PushFile(appToPush, getFilePath(...));
repositoryManager.commitAndPush(iter, (pushFile, fileHash) -> response.add(new PushAppResponse(...)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

writeAppDetail(repositoryManager, pushRequest.getApp());

// it should never be empty as in case of any error we will get an exception
return commitAndPush(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can change the entire method to call multiPush() and return the first response instead of just this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the push method to use multiPush internally does not simplify the implementation, as these 2 methods are logically different (by design).

In case of single application push method, the request object (of type PushAppOperationRequest) already contains the ApplicationDetail of the app to push.

However, the multiPush method takes a request containing a set of app names, and an ApplicationManager. The applicationManager fetches the application details when necessary.

@GnsP GnsP force-pushed the multi-push-operation branch 2 times, most recently from 43c5371 to a5359b7 Compare November 2, 2023 07:55
Copy link
Contributor

@albertshau albertshau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

responses = scmOpRunner.multiPush(pushReq, applicationManager);
context.updateOperationResources(getResources(namespaceId, responses));
} catch (SourceControlException | NoChangesToPushException e) {
throw new OperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pass e as the cause. When wrapping an exception we will basically always want to set the cause if there is one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -417,6 +422,39 @@ private String getFileHash(final Path relativePath, final RevCommit commit)
}
}

private Map<Path, String> getFileHashes(Set<Path> relativePaths, RevCommit commit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this information as a javadoc for the method.

while (walk.next()) {
if (walk.isSubtree()) {
walk.enterSubtree();
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a little more readable to reduce one layer of nesting like:

if (walk.isSubtree()) {
  ...
  continue;
}
Path filePath...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* operations
*/
//TODO: CDAP-20371, Add retry logic here in case the head at remote moved while we are doing push
private <T extends Supplier<Path>> void commitAndPush(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous comment was about changing the RepositoryManager.commitAndPush() method to take the Iterator and BiConsumer. That way this method doesn't need to create the extra collections like the current pathMapping Map.

Copy link
Contributor Author

@GnsP GnsP Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RepositoryManager.commitAndPush needs to construct a set of paths from the iterator anyway, as the set is necessary to efficiently filter the files in the TreeWalk. Also, it makes sense for the RepositoryManager.commitAndPush to return a Map of filepaths to hashes (as it's one operation, that adds multiple files to a commit and pushes) instead of taking a consumer to pass the results one by one. Therefore, updating this method did not improve readability much.

However, the suggested change made perfect sense for the commitAndPush method in the runner, as it simplified the implementation of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comment was mostly about avoiding all of these extra collections, as they add time, memory, and complexity. I don't think it's actually good that commitAndPush returns a Map, as it almost always forces the caller to eventually transform it into some other type of collection.

If we want it to return information about what it committed instead of passing in a consumer, a better approach would be for it to return a Collection<O> and to take a BiFunction<T, String, O>, where O would be a PushAppResponse in this class. I think the current implementation of commitAndPush would make that difficult, but see the other comment I made about getting the file hash right when adding the file instead of walking through the tree after committing.

@GnsP GnsP force-pushed the multi-push-operation branch from a5359b7 to 7239fe1 Compare November 2, 2023 20:57
@@ -240,24 +241,24 @@ public String commitAndPush(final CommitMeta commitMeta,
"No changes have been made for the applications to push.");
}

git.add().addFilepattern(fileChanged.toString()).call();
for (Path fileChanged : filesChanged) {
git.add().addFilepattern(fileChanged.toString()).call();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get the file hash from whatever this returns? I see there is a DirCache.getEntry().getObjectId(), but not sure if that's the same as what's happening in getFileHashes.

If so it would be better than doing the tree walk later, and would allow passing in an iterator of paths instead of the Set.

Copy link
Contributor

@samdgupi samdgupi Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked in test and looks like the hashes are the same. We can skip the tree walk. Thanks for the suggestion @albertshau
@GnsP Please do a reverify before making the changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commitandpush take a BiFunction<T, String, O> and return Collection<O>. Updated all the test accordingly

* operations
*/
//TODO: CDAP-20371, Add retry logic here in case the head at remote moved while we are doing push
private <T extends Supplier<Path>> void commitAndPush(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original comment was mostly about avoiding all of these extra collections, as they add time, memory, and complexity. I don't think it's actually good that commitAndPush returns a Map, as it almost always forces the caller to eventually transform it into some other type of collection.

If we want it to return information about what it committed instead of passing in a consumer, a better approach would be for it to return a Collection<O> and to take a BiFunction<T, String, O>, where O would be a PushAppResponse in this class. I think the current implementation of commitAndPush would make that difficult, but see the other comment I made about getting the file hash right when adding the file instead of walking through the tree after committing.

@samdgupi samdgupi force-pushed the multi-push-operation branch from 7239fe1 to cae5edd Compare November 8, 2023 19:54
@samdgupi samdgupi requested a review from albertshau November 8, 2023 19:56
public String commitAndPush(final CommitMeta commitMeta,
final Path fileChanged)
public <T, S extends Supplier<Path>> Collection<T> commitAndPush(CommitMeta commitMeta, Collection<S> filesChanged,
BiFunction<S, String, T> hashConsumer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc for hashConsumer and the @return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the javadoc

Copy link
Contributor

@albertshau albertshau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, otherwise lgtm

throw new GitOperationException(
String.format("Failed to get fileHash for %s", fileChanged),
e);
List<T> output = new LinkedList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use an ArrayList. It basically always performs better than a LinkedList, even if it needs to get resized underneath. This is especially true when we know the size beforehand (like we do here) and can pass it in to the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@GnsP GnsP force-pushed the multi-push-operation branch 2 times, most recently from 5052709 to 8ec7c41 Compare November 8, 2023 23:38
) {
cloneAndCreateBaseDir(repositoryManager);

LOG.info("Pushing application configs for {} apps.", pushRequest.getApps().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

List<PushFile> filesToPush = new LinkedList<>();

for (String appToPush : pushRequest.getApps()) {
LOG.info("Pushing application configs for : {}.", appToPush);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

ApplicationReference appRef = new ApplicationReference(pushRequest.getNamespaceId(),
appToPush);
try {
ApplicationDetail detail = appManager.get(appRef);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop would be very slow? It is getting one app at a time from the worker remotely? We should add a batch method to the manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this operation uses only the LocalApplicationManager, which gets the ApplicationDetail in memory. The RemoteApplicationManager will be used here, only when the corresponding LongRunningOperationRunner is implemented. @samdgupi can provide more information on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s why the implementation here is not safe. We should have a batch fetch method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK We currently have no API/ which can fetch details by list of ids. We had plan to do this optimization after initial release of the feature. I will add this a TODO and take it up if possible as part of current release.

@GnsP GnsP force-pushed the multi-push-operation branch from 8ec7c41 to 21bfc91 Compare November 9, 2023 00:49
@samdgupi samdgupi merged commit 4b8c345 into develop Nov 9, 2023
5 checks passed
@samdgupi samdgupi deleted the multi-push-operation branch November 9, 2023 06:47
@samdgupi samdgupi changed the title Multi push operation CDAP-20884-Multi push operation Nov 9, 2023
@samdgupi samdgupi changed the title CDAP-20884-Multi push operation CDAP-20884: Multi push operation Nov 9, 2023
samdgupi added a commit that referenced this pull request Nov 9, 2023
samdgupi added a commit that referenced this pull request Nov 9, 2023
samdgupi added a commit that referenced this pull request Nov 9, 2023
samdgupi added a commit that referenced this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.10 build Triggers github actions build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants