Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still probems writing long strings to sav files #346

Closed
sjkiss opened this issue Feb 15, 2018 · 10 comments
Closed

Still probems writing long strings to sav files #346

sjkiss opened this issue Feb 15, 2018 · 10 comments
Labels
bug an unexpected problem or unintended behavior readstat

Comments

@sjkiss
Copy link

sjkiss commented Feb 15, 2018

I am still having problems with this. I am sure I am using the development version of haven.

There are quite a few variables in this dataset that have more than 255 characters. The following script 1) downloads the file I am working with 2) writes it out via haven 3) subsets variables with less than 255 characters in the values 4) writes that data set out for comparison.
Note: it requires the package dataverse to load the file and sets a relevant system variable. Sorry about that;I can't figure out how to make this work without that.

#setwd()
#If necessary, install dataverse
#install.packages('dataverse')
#Load dataverse
library(dataverse)
#Warning: This changes two environment variables that are necessary to search dataverse with the dataverse package; I haven't been able to write this script where these are only changed locally. It's not a huge deal, but be aware.
Sys.setenv("DATAVERSE_SERVER" = "dataverse.scholarsportal.info")
Sys.setenv("DATAVERSE_KEY" = "e66bfc71-7665-40bf-83c2-b7e5a6dc2c33")
#Get the problematic file as a binary file (I think)
out<-get_file('second_survey.tab', 'hdl:10864/10985', 'original')
#> Warning in strptime(x, fmt, tz = "GMT"): unknown timezone 'zone/tz/2018c.
#> 1.0/zoneinfo/America/Toronto'
#> Warning in strptime(x, fmt, tz = "GMT"): unknown timezone 'zone/tz/2018c.
#> 1.0/zoneinfo/America/Toronto'
#Write it out in SPSS format
writeBin(out, 'out.sav')
#AFAIK I have the development version of haven installed
library(haven)
#Read the sav file in
read_out<-read_sav(out)
#Now Write it out and test
write_sav(read_out, 'write_out.sav')
#Two variables that cause problems are
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.4.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
read_out %>%
select(contains('storetime'))
#> # A tibble: 5,309 x 2
#> storetime R_storetime
#>
#> 1 BrowserName_Capture:NaN|QINF0:1|INT… ""
#> 2 BrowserName_Capture:NaN|QINF0:1|INT… ""
#> 3 BrowserName_Capture:NaN|QINF0:3|INT… ""
#> 4 BrowserName_Capture:NaN|QINF0:2|INT… BrowserName_Capture:NaN|QINF0:2|I…
#> 5 "" BrowserName_Capture:NaN|QINF0:1|I…
#> 6 BrowserName_Capture:NaN|QINF0:1|INT… ""
#> 7 "" BrowserName_Capture:NaN|QINF0:2|I…
#> 8 BrowserName_Capture:NaN|QINF0:1|INT… ""
#> 9 BrowserName_Capture:NaN|QINF0:1|INT… ""
#> 10 BrowserName_Capture:NaN|QINF0:0|INT… BrowserName_Capture:NaN|QINF0:1|I…
#> # ... with 5,299 more rows
#When I delete all variables that have string values longer than 255 characters, the sav file that is produced is fine.
read_out %>%
select(which(apply(., 2, function(x) max(nchar(x, keepNA=F)))<255)) %>%
write_sav(., 'write_out_subset_less_than_255.sav')

@sjkiss
Copy link
Author

sjkiss commented Feb 15, 2018

I hope I did that right.

@hadley
Copy link
Member

hadley commented Feb 15, 2018

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Feb 15, 2018
@hadley
Copy link
Member

hadley commented Feb 15, 2018

This seems fine to me currently:

library(haven)

long <- paste0(rep(letters, 100), collapse = "")
df <- data.frame(x = long, stringsAsFactors = FALSE)

path <- tempfile()
write_sav(df, path)

df2 <- read_sav(path)
df$x == long
#> [1] TRUE

Could you please try and create a reprex in that style? (i.e. generating the problematic data rather than downloading from elsewhere)

@rubenarslan
Copy link
Contributor

Sorry, didn't realise a new issue had been opened. But basically this still seems to be #266. The problem is not round-tripping with haven, but that SPSS doesn't open the file.

I think it's still the same problem you can also see with the minimal reprex. This is after installing the latest haven from master. You don't have SPSS to test, right?

library(haven)
n <- 256
df <- data.frame(long = paste(rep("a", n), collapse = ""), stringsAsFactors = FALSE)
write_sav(df, path = "test.sav")

Error. Command name: GET FILE
Invalid SPSS Statistics data file: test.sav (DATA1204)
Execution of this command stops.
Error # 1405 in column 8. Text: test.sav
Error when attempting to get a data file.
GET
FILE='test.sav'.

@hadley
Copy link
Member

hadley commented Feb 16, 2018

@rubenarslan thanks! It wasn't clear that this was the problem. Can you please confirm that you're using the latest development version of haven (i.e. you installed in the last 12 hours)?

@hadley hadley added readstat bug an unexpected problem or unintended behavior and removed reprex needs a minimal reproducible example labels Feb 16, 2018
@rubenarslan
Copy link
Contributor

Yes!

devtools::install_github("hadley/haven")
Skipping install of 'haven' from a github remote, the SHA1 (cef5421) has not changed since last install.

@hadley
Copy link
Member

hadley commented Feb 16, 2018

@evanmiller looks like another one for you

@rubenarslan it might be useful for you to create a .sav file in SPSS containing exactly the same long string so we can see what's different.

@rubenarslan
Copy link
Contributor

I uploaded those files back in #266. Maybe just reopen the issue?

@hadley
Copy link
Member

hadley commented Feb 16, 2018

Good idea - I'll clean up the discussion which got sidetrack there.

@lock
Copy link

lock bot commented Aug 15, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Aug 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior readstat
Projects
None yet
Development

No branches or pull requests

3 participants