Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field with any newline character is truncated #95

Open
Adnan-cds opened this issue Mar 12, 2021 · 8 comments
Open

Field with any newline character is truncated #95

Adnan-cds opened this issue Mar 12, 2021 · 8 comments

Comments

@Adnan-cds
Copy link

Hello, I was trying to create a CSV file where one of the fields lists multiple URLs. For peoples' viewing pleasure, I thought I will separate those URLs using "\n". Kind of like this:

"Source page node id", "Broken Links in this page"
42, "https://example.net/foo
https://example.net/bar
https://example.net/bar"
162, "https://example.net/qux"

But the output from drush insert-command-here --format=csv kept producing a truncated version like the following:

"Source page node id", "Broken Links in this page"
42, "https://example.net/foo
162, "https://example.net/qux"

Turned out the CSV formatter is keeping the first line of each CSV row only:

$csv = fgets($buffer);

Fix is easy (patch pasted below). I am happy to raise a pull request, but I am wondering if the current behaviour is intentional or not. Please advice.

--- /path/to/drupal/vendor/consolidation/output-formatters/src/Formatters/CsvFormatter.php     2020-10-11 05:15:32.000000000 +0100
+++ CsvFormatter.php    2021-03-12 20:03:39.327122512 +0000
@@ -123,9 +123,11 @@
         } else {
             fputcsv($buffer, $data, $delimiter, $enclosure);
         }
+        $buffer_length = ftell($buffer);
         rewind($buffer);
-        $csv = fgets($buffer);
+        $csv = fread($buffer, $buffer_length);
         fclose($buffer);
+
         return $csv;
     }
 }
@greg-1-anderson
Copy link
Member

The current implementation of the csv formatter was not intended to support fields with embedded line breaks; however, I would not be opposed to a pull request to allow for this capability, provided that it included at least one test.

@donquixote
Copy link

donquixote commented Jul 13, 2023

From previous experience with csv in php:
The php csv implementation is a bit silly in it using an "escape char".
The more common format does not use an escape char, instead you would use " as the enclosure, and then double "" is a literal ".
This is more sane, more common, and will be understood by most programs.
I think it is not a real standard but an RFC that most people follow, except php.

This would do the trick, for comma as delimiter - but obviously you want to support different values for the parameters.

    protected function csvEscape($data, $delimiter = ',', $enclosure = '"', $escapeChar = "\\")
    {
        return implode(',', array_map(static function (string $value): string {
          if (!preg_match('@["\n,]@', $value)) {
            return $value;
          }
          return '"' . str_replace('"', '""', $value) . '"';
        }, $data)) . "\n";
      }

EDIT: A "\n" was missing.

@greg-1-anderson
Copy link
Member

The definition of the csv data type in the output formatters project is that each record appears on a separate line; therefore, if embedded newlines are to be supported by this format, they would need to be converted to some form of escape character.

Your program could install a custom formatter, or replace the csv formatter with a custom implementation that behaves the way that you want, but we can't add embedded newlines to the csv formatter provided here.

@donquixote
Copy link

The definition of the csv data type in the output formatters project is that each record appears on a separate line;

May I ask why this definition exists in this way and is such a hard requirement?
Is this mainly for BC, or is there a deeper reason?

If BC is a concern, we can look for ways around that.

therefore, if embedded newlines are to be supported by this format, they would need to be converted to some form of escape character.

This would not be understood by typical programs that can read csv, e.g. spreadsheet applications.

@greg-1-anderson
Copy link
Member

Yes, it would require a major version to change the behavior. As for ways around it, perhaps a property could stipulate behavior, but it's already possible to install your own formatter, or replace a standard one if you'd prefer different behavior in your application.

@donquixote
Copy link

Would you accept an alternative csv formatter as part of this package?
How would we name it? csv-rfc-4180? Or csv-multiline?
https://www.rfc-editor.org/rfc/rfc4180
https://www.ietf.org/rfc/rfc4180.txt (same thing)

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

https://stackoverflow.com/questions/10140999/csv-with-comma-or-semicolon

Unfortunately, RFC 4180 is a request for comments, NOT a standard. It says right at the top -- "does not specify an Internet standard of any kind." Later, it says that RFC 4180 "documents the format that seems to be followed by most implementations."

(this is in one of the comments)

So, not a real standard, but still better than what php produces with native fputcsv().

I think the remaining problem with office applications is character encoding, which is independent from the escape char and enclosure syntax.

@donquixote
Copy link

it's already possible to install your own formatter

Actually, how?
I see FormatterManager->addFormatter(), and I see how drush uses this to add a new HelpCLIFormatter(). But it does that within a specific command only ("help").

For context: Atm I am working with openeuropa/task-runner. Other times I work with drush. Atm I would want the formatter to be globally available in a website project. Other times I might want to ship the formatter with a shared package or module.

@donquixote
Copy link

Actually, the inconsistent handling in php has been fixed, it seems :)

In the docs for all the csv functions we find this:

escape
The optional escape parameter sets the escape character (at most one single-byte character). An empty string ("") disables the proprietary escape mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants