Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error trying to open a tab-separated file #96

Closed
prvst opened this issue Oct 14, 2021 · 10 comments
Closed

Error trying to open a tab-separated file #96

prvst opened this issue Oct 14, 2021 · 10 comments
Labels
bug Something isn't working hacktoberfest-accepted hacktoberfest-accepted help wanted Extra attention is needed

Comments

@prvst
Copy link

prvst commented Oct 14, 2021

Hi, I'm trying to visualize a TSV file, and I'm getting this error:

thread 'main' panicked at 'a csv record: Error(UnequalLengths { pos: Some(Position { byte: 513, line: 2, record: 1 }), expected_len: 37, len: 8 })', src/main.rs:312:20
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I'm using the latest Linux release.

@alexhallam
Copy link
Owner

Thanks! Could you paste the command you used and a small (1-2 line) csv so I can reproduce the problem?

@prvst
Copy link
Author

prvst commented Oct 14, 2021

Sure!
This is the command: tidy-viewer -s \t test.csv. The file has CSV in the name, but it's using tabs as delimiters.
test.csv

@alexhallam
Copy link
Owner

Perfect! I will work on this.

@alexhallam alexhallam added bug Something isn't working good first issue Good for newcomers hacktoberfest-accepted hacktoberfest-accepted labels Oct 14, 2021
@alexhallam alexhallam self-assigned this Oct 14, 2021
@alexhallam alexhallam added hacktoberfest-accepted hacktoberfest-accepted and removed hacktoberfest-accepted hacktoberfest-accepted good first issue Good for newcomers labels Oct 14, 2021
@alexhallam alexhallam removed their assignment Oct 14, 2021
@alexhallam
Copy link
Owner

@Lireer do you mind if I pick your brain a little? It turns out that this issue is a little more difficult than I first thought. The csv crate is aware of \t, but tv is not parsing the argument "\t" properly.

There are two problems

  1. I need to modify the match case to accept more than 1 byte delimiters. This will not be a problem.
  2. I need to somehow evaluate \t as oppose to reading it as a literal string. This is the tricky part. I was curious if you were aware of an elegant way to handle this.

For part 1 the relevant code is here:

tv/src/datatype.rs

Lines 227 to 237 in 1206937

pub fn parse_delimiter(src: &str) -> Result<u8, String> {
let bytes = src.as_bytes();
match bytes.len() {
1 => Ok(bytes[0]),
_ => Err(format!(
"expected one byte as a delimiter, got {} bytes (\"{}\")",
bytes.len(),
src
)),
}
}

For part 2 the relevant code is here:

tv/src/main.rs

Lines 573 to 587 in 1206937

fn build_reader(opt: &Cli) -> Result<Reader<Box<dyn Read>>, std::io::Error> {
let source: Box<dyn Read> = if let Some(path) = opt.file.clone() {
let file = File::open(path)?;
Box::new(BufReader::new(file))
} else {
Box::new(io::stdin())
};
let reader = ReaderBuilder::new()
.has_headers(false)
.delimiter(opt.delimiter)
.from_reader(source);
Ok(reader)
}

@Lireer
Copy link
Contributor

Lireer commented Oct 14, 2021

I don't think the problem lies with the parsing of the delimiter, but with the way the TAB is entered into the shell. The correct way to enter the byte for a TAB 0x09 depends on the shell. What we actually got from the shell (bash) was just the character t (0x74). Reading and printing work without any problems if an actual TAB character is received from the shell.

The working solutions for some shells:

  • bash: $'\t'
  • sh: "$(printf "\t")"
  • fish: \t
  • powershell: "`t" (the backtick interferes with the markdown formatting)

We could check the special case of the delimiter being the string \t (0x5c74).

Some more things I tried and what we receive as a delimiter:

input bash sh fish powershell
\t t (0x74) t (0x74) TAB (0x09) -
'\t' \t (0x5c74) \t (0x5c74) \t (0x5c74) -
"\t" \t (0x5c74) \t (0x5c74) \t (0x5c74) -
\\t \t (0x5c74) \t (0x5c74) \t (0x5c74) -
$'\t' TAB (0x09) $\t (0x245c74) - -
"`t" - - - TAB (0x09)

@alexhallam
Copy link
Owner

I really like the idea of checking for the special case \t (0x5c74) . I was naturally trying to input "\t" and '\t'

@alexhallam
Copy link
Owner

alexhallam commented Oct 16, 2021

Here is another thought. If the file extension is *.tsv then automatically use the tab delimiter. I like this because it solves the issue upstream. What are your thoughts @Lireer and @prvst.

I am fine being opinionated. As long as it is documented that tv views csvs as files that have single byte delimiters and tsvs as files that are tab separated then I think this is a good solution.

Also, according to the above table \t is not something that returns a hex on powershell. It may be that *.tsv is the most accessible.

@Lireer
Copy link
Contributor

Lireer commented Oct 16, 2021

Seems like a good idea to me as long as the effects of different file extensions are documented and setting the delimiter via cli options is easier than changing a files extension. So in my opinion we should check for the extension and parse values like \t. Are there other values that should be accepted (see ANSI-C Quoting) and where do we draw the line of what is expected to work and what isn't. Maybe just start with the extension check and \t, others can be added later and until then the user has to figure out how their shell works.

Some thoughts on the design of the API:

Implementing any of this shouldn't be a problem, I'm just wondering what kind of behavior and features do users expect and use. Would they expect a .tsv file to automatically use tabs as delimiters? I guess the answer to that is likely yes, but I don't know. They shouldn't feel the need to find some hacky solution, e.g. the shell makes entering the tab byte hard, so they change the extension of their file to be recognized as TSV.

In general these options configure the reader using different ways. This means there should be an intuitive order of priority. I feel like it's quite easy to make automatic features unintuitive, so we should make sure they are clearly documented. For me the expected behavior would be:

  1. Standard csv configuration as the default
  2. Maybe a config file
  3. Check the file extension for TSV or PSV (pipe-separated values)
    • Are there more formats and what do they change? Just the delimiter?
  4. CLI options should take precedence over everything else

@alexhallam
Copy link
Owner

I agree with making the CLI options take precedence. In my experience the most common exception to the single byte delimiters like ,, :, ;, |, etc. is \t and that is probably the only exception that makes sense. I am fine ignoring the many other ANSI-C quoting special characters.

So lets just allow users to pass -s "\t" for tab separated files.

@alexhallam alexhallam added the help wanted Extra attention is needed label Oct 17, 2021
@alexhallam
Copy link
Owner

Close with #99 by @Lireer, unfortunately I believe this file is malformed in addition to being a tsv. How to handle these problems is an open issue #79

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hacktoberfest-accepted hacktoberfest-accepted help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants