Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual variables in ReadStat-created SAV files with long strings are visible un-merged in SPSS #122

Closed
rubenarslan opened this issue Feb 16, 2018 · 33 comments
Labels

Comments

@rubenarslan
Copy link

Continuing from tidyverse/haven#266:
Although the latest fix works for 256 character variables and reading the file in SPSS, the foreign R package cannot read the file. Also, once we go to 512 characters, or 1024 characters, we start seeing the "virtual variables" (I presume) in SPSS rather than one long string.
I'll try to build readstat tomorrow and create a reproducible example from CSVs, Friday night over here now.

@evanmiller
Copy link
Contributor

Thanks. I think some of the problems will be fixed here:

39b4a48

If problems persist, please open a separate issue for each distinct problem that you are encountering.

@rubenarslan rubenarslan changed the title ReadStat-created SAV files with long strings cannot be read by SPSS and foreign package Virtual variables in ReadStat-created SAV files with long strings are visible un-merged in SPSS Feb 17, 2018
@rubenarslan
Copy link
Author

rubenarslan commented Feb 17, 2018

Thank you. The foreign package now reads the files properly (I didn't mean this as a separate problem but rather as a way to test without SPSS). Apparently foreign (unlike ReadStat) cannot automatically turn long strings in virtual variables into one variable (it throws a warning about this). But the import looks the same now, no matter whether the file was generated in SPSS or haven (except that SPSS-generated virtual vars have the name LONG0-9 while ReadStat does V0000001-9; maybe that is it?).

I now built the latest ReadStat. I then generated a 2560 char string in SPSS (test_2560_SPSS.sav) in the zip. SPSS does not show the virtual variables.
When I do readstat test_2560_SPSS.sav test_2560_rs.sav, SPSS shows the virtual variables though (and the long variable isn't long).

I also attached a 2560char file generated through haven, it looks the same in SPSS. HTH.

comparison_of_files.zip
image

@evanmiller
Copy link
Contributor

Thanks. This additional debugging information is helpful. I've made some more internal changes, including reporting the SPSS version number as 20 (same as the file you provided):

11e1702

It's possible that one of these changes will trigger the column merge within SPSS. If not, I'll tinker with the virtual variable names.

@rubenarslan
Copy link
Author

That didn't do it unfortunately, the result is unchanged. Yeah, maybe try the variable names.
(Assuming I rebuilt correctly by pulling and re-running the commands from the readme)

@evanmiller
Copy link
Contributor

Thanks for testing. It would help me if you create a similar file with a variable name that is 8 characters long - I'm curious how SPSS handles the virtual variable numbering. (Internally, SPSS variable names are limited to 8 characters.)

@rubenarslan
Copy link
Author

rubenarslan commented Feb 18, 2018

Here, I made one with 8 and one with 12 char var names, and three with two/three/five vars that have the same first 8 chars. Apparently (checked with foreign) they take the first five chars of the var name, and if those are duplicates, they start with the last variable with digits, then switch to letters once those are exhausted, then letters with digits. Ugh.

verylong.zip

@evanmiller
Copy link
Contributor

Thanks for the research. I sincerely hope all these letter-digit acrobatics aren't necessary.

One last request (I hope): Can you make a file with variables called var0 and var1? That's the current ReadStat naming convention... I'm curious if the virtual variables are VAR00, VAR01, or something else.

@rubenarslan
Copy link
Author

Sure. The names are

names(xx)

[1] "var0" "VAR00" "VAR01" "VAR02" "VAR03" "VAR04" "VAR05" "VAR06" "VAR07" "VAR08" "VAR09" "var1" "VAR10" "VAR11" "VAR12"
[16] "VAR13" "VAR14" "VAR15" "VAR16" "VAR17" "VAR18" "VAR19"

@evanmiller
Copy link
Contributor

Thanks - for 10+ virtual variables does it wrap around to letters? Tbh I'm okay just supporting 2560-character variables to start, if we can get this working.

@rubenarslan
Copy link
Author

it does (3000chars)

[1] "var0" "VAR00" "VAR01" "VAR02" "VAR03" "VAR04" "VAR05" "VAR06" "VAR07" "VAR08" "VAR09" "VAR0A" "var1" "VAR10" "VAR11"
[16] "VAR12" "VAR13" "VAR14" "VAR15" "VAR16" "VAR17" "VAR18" "VAR19" "VAR1A"

@evanmiller
Copy link
Contributor

Thanks. Try this:

bee6e4d

If that works for 0-9, I'll look into doing the letter wraparound thing. It shouldn't be too hard, I'm guessing they just use base-36.

@rubenarslan
Copy link
Author

Sorry, it doesn't work, but you're still not using the variable name stem, but the generic VAR00000, right? Or I'm not rebuilding right?

@evanmiller
Copy link
Contributor

Hi, I'm now given the variables 5-character names:

V0000
V0001

Then the virtual variables use this stem:

V00000
V00001
...
V00010
V00011
...

If this strategy isn't working, then it might be something else that's preventing SPSS from doing the merge.

@rubenarslan
Copy link
Author

Ah, okay. Well SPSS seems to derive the 5-char names from shortening the visible names, which you don't do. Truthfully, I have no idea if this is what prevents the merge.

@evanmiller
Copy link
Contributor

Try naming your variables v0000 and v0001. Then the naming convention should match that of SPSS. If that still doesn't work, it's probably something else.

@rubenarslan
Copy link
Author

Yay! When I name the variable long v0000, it works. At long last ;-)

@evanmiller
Copy link
Contributor

Okay, that is great to know!

Overall, the SPSS naming algorithm seems pretty complicated, so for now I will provide just enough support that you will be able to work around the limitations of both SPSS and ReadStat. I'd like to support more than the 10,000 columns implied by the v0000 format, but for now I'll live with that limitation.

The virtual variables will then use the SPSS convention of a base-36 suffix. To start I'll just support a single suffix character (e.g. v00000 ... v0000Z), so you'll be able to write string variables up to 255 * 36 = 9,180 characters in length.

@rubenarslan
Copy link
Author

Thanks. Luckily for me, that's above the length of the longest string, because of which I originally raised the issue.

@evanmiller
Copy link
Contributor

Try this:

abc976d

@rubenarslan
Copy link
Author

That's not it.. (this is from the command readstat test_2560_SPSS.sav test_2560_rs6.sav).
I think you really have to match the original var name exactly.
image

@evanmiller
Copy link
Contributor

Okay, I am wondering if SPSS makes an exception for the format V2_A, V3_A, etc., since that's the format it uses in the event of name clashes.

Try this:

0817838

@evanmiller
Copy link
Contributor

Hang on, need to use 1-indexing instead of 0-indexing.

@rubenarslan
Copy link
Author

ok, because this also didn't work.

@evanmiller
Copy link
Contributor

Ok, try this:

fc1986c

If that doesn't work I'll try to implement the complete SPSS algorithm.

@rubenarslan
Copy link
Author

It doesn't work, sorry.

@evanmiller
Copy link
Contributor

Thanks. The full algorithm is complicated so I'm afraid it'll have to wait. I'll leave this issue open though.

@rubenarslan
Copy link
Author

No urgency on my side. Thanks for the hard and free labour..

@evanmiller
Copy link
Contributor

Hi, please try the latest update and let me know if that fixes things for you. I've tried to make SPSS-compatible variable names, though without full name-conflict resolution.

47569c0

@rubenarslan
Copy link
Author

👍 it looks good! I've only tried with my test examples and with two variable called long and long2, but it works!

@evanmiller
Copy link
Contributor

@rubenarslan Great, thanks for letting me know! If you find corner cases etc where the problem persists or the import doesn't work, please file a new issue. For now I will close.

One last question: What version of SPSS are you running? I have received scattered reports that SPSS 25 won't import from ReadStat - but haven't been able to confirm.

@rubenarslan
Copy link
Author

rubenarslan commented Dec 18, 2018

Sorry, I just have v.20

@sjkiss
Copy link

sjkiss commented Jul 4, 2022

I am having this issue. That is to say, I am exporting a data frame with two variables that contain character strings longer than 255 characters. But there is something weird. In my case, I am taking two variables that have character strings, many of which are > 255 characters. In an earlier project I exported those to an excel file for spell checking. and now I have to reimport them . I'm really fine to do the re-import and join. But when I then export to an sav file, please note that the original variables (i.e. ending in .x) are exported as single variables. But the new variables are split at 255 characters.

I feel like some of the people who have encountered this have come up with some hacks, but they are impenetrable to me.

library(haven)
library(tidyverse)

#Import the data file
on18<-read_sav(file="https://github.com/sjkiss/ON18/raw/master/Data/Ontario%20ES%202018%20LISPOP.sav")

#Show the variables of interest
on18 %>% 
  select(immfeel, indivfinfeel) %>% 
  map(., nchar) %>% 
  map(., summary)
#Read in spellechecked file 
out<-read.csv(file="https://github.com/sjkiss/ON18/raw/master/Data/feelings_workfile_spellcheck.csv")
#merge the two
#indivfinfeel.y and immfeel.y ARE SPELLCHECKED
on18 %>% 
  left_join(., out, by='id') -> on18

on18 %>% 
  select(immfeel.x,immfeel.y,indivfinfeel.x, indivfinfeel.y) %>% 
  write_sav(.,here("Data/test_from_github.sav"))

Screen Shot 2022-07-04 at 2 40 29 PM

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] openxlsx_4.2.4  stringi_1.7.6   car_3.0-12      carData_3.0-4   labelled_2.9.0  here_1.0.1      haven_2.5.0    
 [8] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.1.1     tidyr_1.1.4     tibble_3.1.6   
[15] ggplot2_3.3.5   tidyverse_1.3.1 rio_0.5.29     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        lubridate_1.8.0   lattice_0.20-45   rprojroot_2.0.2   assertthat_0.2.1  digest_0.6.29    
 [7] psych_2.1.9       utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.0   reprex_2.0.1     
[13] evaluate_0.14     httr_1.4.2        pillar_1.6.4      rlang_0.4.12      curl_4.3.2        readxl_1.3.1     
[19] rstudioapi_0.13   data.table_1.14.2 rmarkdown_2.11    foreign_0.8-81    munsell_0.5.0     broom_0.7.10     
[25] compiler_4.1.2    modelr_0.1.8      xfun_0.28         pkgconfig_2.0.3   mnormt_2.0.2      tmvnsim_1.0-2    
[31] htmltools_0.5.2   tidyselect_1.1.1  fansi_0.5.0       withr_2.4.3       crayon_1.4.2      tzdb_0.2.0       
[37] dbplyr_2.1.1      grid_4.1.2        nlme_3.1-153      jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1  
[43] DBI_1.1.1         magrittr_2.0.1    scales_1.1.1      zip_2.2.0         cli_3.3.0         fs_1.5.1         
[49] xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1    vctrs_0.3.8       tools_4.1.2       glue_1.6.2       
[55] hms_1.1.1         abind_1.4-5       parallel_4.1.2    fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2 
[61] rvest_1.0.2       knitr_1.36     

@sjkiss
Copy link

sjkiss commented Jul 5, 2022

I read more closely in Long string handling #118.

This:

Modify the variable name of long-string columns such that the name contains 5 characters or less. This is apparently to account for suffixes that readstat adds behind the scenes when it generates variable segments.

worked like a charm. I am sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants