Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove 1.1MB (85%) of binary size by not including iostream #41

Merged
merged 2 commits into from
Nov 24, 2020

Conversation

biojppm
Copy link
Contributor

@biojppm biojppm commented Nov 24, 2020

Picking up on the discussion started on #23 about large binaries: it is definitely real. This is what I'm seeing on the size harness that I detailed on that discussion:

[jpmag@pc] 3039$ for i in * ; do echo "$i/`cat $i/bm/float/*fast_float_d*dat`" ; done
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.343s, file_size: 1447312B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.353s, file_size: 1391776B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.318s, file_size: 1451928B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.203s, file_size: 1391768B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.334s, file_size: 1462908B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.328s, file_size: 1392740B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.203s, file_size: 1457560B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.191s, file_size: 1392732B}

[jpmag@pc] 3040$ for i in * ; do echo "$i/`cat $i/bm/float/*fast_float_f*dat`" ; done
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.349s, file_size: 1447312B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.361s, file_size: 1391776B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.334s, file_size: 1451904B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.234s, file_size: 1391768B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.412s, file_size: 1462892B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.365s, file_size: 1392740B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.225s, file_size: 1457524B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.194s, file_size: 1392732B}

Notice that this happens with both g++ and clang++, for x86 and x86_64 and also for Debug and Release. Notice also that the baseline executable consisting of the while(fgets()) { fputs() } is rarely above 20KB.

[jpmag@pc] 3041$ for i in * ; do echo "$i/`cat $i/bm/float/*base*dat`" ; done
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-baseline: {compile: 0.173s, file_size: 20096B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-baseline: {compile: 0.156s, file_size: 16800B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-baseline: {compile: 0.150s, file_size: 21072B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-baseline: {compile: 0.122s, file_size: 16800B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-baseline: {compile: 0.196s, file_size: 19988B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-baseline: {compile: 0.164s, file_size: 15640B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-baseline: {compile: 0.124s, file_size: 19804B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-baseline: {compile: 0.127s, file_size: 15676B}

When you point out that the fast_float code is small, you are right. But there is an #include <iostream>, and that is usually reason enough to cause bloated binaries. It brings a mountain of code: 30K lines and 713K characters, together with exceptions, new()s, delete()s, etc:

[jpmag@pc] 3051$ echo "#include <iostream>" | g++ -E -x c++ - | wc -lc
  29998  712929

Let's look at the sizes for iostream:

[jpmag@pc] 3052$ for i in * ; do echo "$i/`cat $i/bm/float/*iostream_f*dat`" ; done
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-iostream_f: {compile: 0.343s, file_size: 1357672B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-iostream_f: {compile: 0.328s, file_size: 1345232B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-iostream_f: {compile: 0.307s, file_size: 1362576B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-iostream_f: {compile: 0.226s, file_size: 1345272B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-iostream_f: {compile: 0.424s, file_size: 1356560B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-iostream_f: {compile: 0.355s, file_size: 1343316B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-iostream_f: {compile: 0.299s, file_size: 1356252B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-iostream_f: {compile: 0.189s, file_size: 1347436B}

[jpmag@pc] 3053$ for i in * ; do echo "$i/`cat $i/bm/float/*iostream_d*dat`" ; done
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-iostream_d: {compile: 0.346s, file_size: 1357672B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-iostream_d: {compile: 0.368s, file_size: 1345232B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-iostream_d: {compile: 0.333s, file_size: 1362576B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-iostream_d: {compile: 0.220s, file_size: 1345272B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-iostream_d: {compile: 0.324s, file_size: 1356560B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-iostream_d: {compile: 0.331s, file_size: 1343316B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-iostream_d: {compile: 0.202s, file_size: 1356252B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-iostream_d: {compile: 0.208s, file_size: 1347436B}

Don't these sizes look suspiciously similar to fast_float above? Let's check:

[jpmag@pc] 3054$ bloaty -d segments,sections,symbols linux-x86_64-gxx10.2-Debug/bm/float/c4core-bm-readfloat-fast_float_d
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  47.9%   678Ki  67.7%   678Ki    LOAD #3 [RX]
   100.0%   678Ki 100.0%   678Ki    .text
      70.1%   475Ki  70.1%   475Ki    [1476 Others]
       5.2%  35.1Ki   5.2%  35.1Ki    std::num_get<>::_M_extract_int<>()
       3.5%  23.8Ki   3.5%  23.8Ki    [section .text]
       2.9%  20.0Ki   2.9%  20.0Ki    std::__cxx11::money_get<>::_M_extract<>()
       2.5%  16.9Ki   2.5%  16.9Ki    std::money_get<>::_M_extract<>()
       2.3%  15.6Ki   2.3%  15.6Ki    d_print_comp_inner
       1.3%  8.68Ki   1.3%  8.68Ki    std::__cxx11::money_put<>::_M_insert<>()
       1.1%  7.50Ki   1.1%  7.50Ki    std::num_get<>::_M_extract_float()
       1.1%  7.16Ki   1.1%  7.16Ki    std::money_put<>::_M_insert<>()
       1.0%  7.05Ki   1.0%  7.05Ki    std::__moneypunct_cache<>::_M_cache()
       1.0%  7.04Ki   1.0%  7.04Ki    _ZNKSt7__cxx118time_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE21_M_extract_via_formatES4_S4_RSt8ios_baseRSt12_Ios_IostateP2tmPKc.localalias
       0.9%  5.99Ki   0.9%  5.99Ki    std::__facet_shims::__moneypunct_fill_cache<>()
       0.9%  5.87Ki   0.9%  5.87Ki    std::num_get<>::do_get()
       0.8%  5.68Ki   0.8%  5.68Ki    std::basic_fstream<>::basic_fstream()
       0.8%  5.55Ki   0.8%  5.55Ki    std::__cxx11::time_get<>::get()
       0.8%  5.45Ki   0.8%  5.45Ki    _ZNKSt7__cxx118time_getIwSt19istreambuf_iteratorIwSt11char_traitsIwEEE21_M_extract_via_formatES4_S4_RSt8ios_baseRSt12_Ios_IostateP2tmPKw.localalias
       0.8%  5.37Ki   0.8%  5.37Ki    std::__cxx11::moneypunct<>::_M_initialize_moneypunct()
       0.8%  5.37Ki   0.8%  5.37Ki    std::moneypunct<>::_M_initialize_moneypunct()
       0.8%  5.17Ki   0.8%  5.17Ki    std::time_get<>::get()
       0.7%  4.90Ki   0.7%  4.90Ki    std::num_put<>::_M_insert_int<>()
       0.7%  4.87Ki   0.7%  4.87Ki    _ZNKSt8time_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE21_M_extract_via_formatES3_S3_RSt8ios_baseRSt12_Ios_IostateP2tmPKc.localalias
     0.0%      48   0.0%      48    .plt
     0.0%      32   0.0%      32    .plt.got
     0.0%      27   0.0%      27    .init
     0.0%      13   0.0%      13    .fini
     0.0%       8   0.0%       8    [LOAD #3 [RX]]
  29.9%   424Ki   0.0%       0    [Unmapped]
    58.7%   249Ki   NAN%       0    .strtab
      85.2%   212Ki   NAN%       0    [1867 Others]
       1.8%  4.38Ki   NAN%       0    std::__cxx11::basic_string<>::replace()
       1.3%  3.35Ki   NAN%       0    std::__cxx11::basic_string<>::basic_string()
       1.2%  2.99Ki   NAN%       0    std::use_facet<>()
       1.1%  2.64Ki   NAN%       0    std::has_facet<>()
       0.9%  2.30Ki   NAN%       0    std::num_get<>::do_get()
       0.9%  2.24Ki   NAN%       0    std::num_get<>::get()
       0.9%  2.21Ki   NAN%       0    [section .strtab]
       0.8%  1.94Ki   NAN%       0    std::__cxx11::basic_string<>::insert()
       0.6%  1.44Ki   NAN%       0    std::num_get<>::_M_extract_int<>()
       0.6%  1.38Ki   NAN%       0    std::__cxx11::basic_string<>::_M_construct<>()
       0.6%  1.38Ki   NAN%       0    std::operator<< <>()
       0.5%  1.36Ki   NAN%       0    std::num_put<>::do_put()
       0.5%  1.32Ki   NAN%       0    std::num_put<>::put()
       0.5%  1.27Ki   NAN%       0    std::operator>><>()
       0.5%  1.27Ki   NAN%       0    std::__cxx11::moneypunct<>::moneypunct()
       0.4%  1.12Ki   NAN%       0    std::basic_string<>::basic_string()
       0.4%  1.08Ki   NAN%       0    std::moneypunct<>::moneypunct()
       0.4%  1.07Ki   NAN%       0    std::__cxx11::moneypunct_byname<>::moneypunct_byname()
       0.4%  1.05Ki   NAN%       0    std::time_put_byname<>::time_put_byname()
       0.4%  1.05Ki   NAN%       0    std::__facet_shims::__moneypunct_fill_cache<>()
    28.6%   121Ki   NAN%       0    .symtab
      86.4%   104Ki   NAN%       0    [1872 Others]
       3.1%  3.75Ki   NAN%       0    [section .symtab]
       1.1%  1.29Ki   NAN%       0    (anonymous namespace)::get_global()::global
       1.0%  1.22Ki   NAN%       0    std::__cxx11::basic_string<>::basic_string()
       0.9%  1.03Ki   NAN%       0    std::__cxx11::basic_string<>::replace()
       0.9%  1.03Ki   NAN%       0    std::use_facet<>()
       0.8%     960   NAN%       0    std::has_facet<>()
       0.5%     624   NAN%       0    std::basic_string<>::basic_string()
       0.5%     624   NAN%       0    std::string::string()
       0.5%     576   NAN%       0    std::__cxx11::moneypunct<>::moneypunct()
       0.5%     576   NAN%       0    std::__facet_shims::(anonymous namespace)::moneypunct_shim<>
       0.5%     576   NAN%       0    std::__facet_shims::(anonymous namespace)::moneypunct_shim<>::~moneypunct_shim()
       0.5%     576   NAN%       0    std::moneypunct<>::moneypunct()
       0.4%     528   NAN%       0    std::__cxx11::basic_string<>::insert()
       0.4%     528   NAN%       0    std::num_get<>::do_get()
       0.4%     528   NAN%       0    std::num_get<>::get()
       0.4%     528   NAN%       0    std::operator<< <>()
       0.4%     480   NAN%       0    std::operator>><>()
       0.3%     408   NAN%       0    std::basic_istream<>::operator>>()
       0.3%     408   NAN%       0    std::basic_ostream<>::operator<<()
       0.3%     408   NAN%       0    std::istream::operator>>()
     6.6%  28.1Ki   NAN%       0    .debug_info

...... continues

A lot of entries suspiciously related to stream/string. So let's see what happens if we remove these:

modified   include/fast_float/decimal_to_binary.h
@@ -10,7 +10,6 @@
 #include <cstdio>
 #include <cstdlib>
 #include <cstring>
-#include <iostream>
 
 namespace fast_float {
 
modified   include/fast_float/float_common.h
@@ -363,8 +363,8 @@ constexpr int binary_format<float>::smallest_power_of_ten() {
 } // namespace fast_float
 
 // for convenience:
-#include <ostream>
-inline std::ostream &operator<<(std::ostream &out, const fast_float::decimal &d) {
+template<class OStream>
+inline OStream& operator<<(OStream &out, const fast_float::decimal &d) {
   out << "0.";
   for (size_t i = 0; i < d.num_digits; i++) {
     out << int32_t(d.digits[i]);

... and as I expected the result is now this:

[jpmag@pc] 3055$ for i in * ; do echo "$i/`cat $i/bm/float/*fast_float_f*dat`" ; done                                               
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.164s, file_size: 203176B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.141s, file_size: 149080B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.103s, file_size: 211976B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.092s, file_size: 34488B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.128s, file_size: 219004B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.131s, file_size: 150748B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-fast_float_f: {compile: 0.225s, file_size: 1457524B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-fast_float_f: {compile: 0.093s, file_size: 37492B}

[jpmag@pc] 3056$ for i in * ; do echo "$i/`cat $i/bm/float/*fast_float_d*dat`" ; done 
linux-x86_64-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.171s, file_size: 203184B}
linux-x86_64-clangxx11.0-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.129s, file_size: 149080B}
linux-x86_64-gxx10.2-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.103s, file_size: 212008B}
linux-x86_64-gxx10.2-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.080s, file_size: 34488B}
linux-x86-clangxx11.0-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.143s, file_size: 219020B}
linux-x86-clangxx11.0-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.133s, file_size: 150752B}
linux-x86-gxx10.2-Debug/c4core-bm-readfloat-fast_float_d: {compile: 0.203s, file_size: 1457560B}
linux-x86-gxx10.2-Release/c4core-bm-readfloat-fast_float_d: {compile: 0.094s, file_size: 37492B}

So the size went down from 1.4MB to 0.2MB. The new clang size of 200KB is still high, but we can take a look at that at a later occasion. Let's take a look at the new binary:

[jpmag@pc] 3057$ bloaty -d segments,sections,symbols linux-x86_64-gxx10.2-Debug/bm/float/c4core-bm-readfloat-fast_float_d
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  39.6%  81.9Ki  63.4%  81.9Ki    LOAD #3 [RX]
    99.8%  81.8Ki  99.8%  81.8Ki    .text
      36.6%  30.0Ki  36.6%  30.0Ki    [216 Others]
      19.0%  15.6Ki  19.0%  15.6Ki    d_print_comp_inner
       5.9%  4.81Ki   5.9%  4.81Ki    fast_float::parse_long_mantissa<>()
       5.5%  4.52Ki   5.5%  4.52Ki    fast_float::from_chars<>()
       3.1%  2.55Ki   3.1%  2.55Ki    d_type
       2.8%  2.31Ki   2.8%  2.31Ki    d_print_mod
       2.6%  2.11Ki   2.6%  2.11Ki    execute_cfa_program
       2.4%  1.93Ki   2.4%  1.93Ki    search_object
       2.2%  1.80Ki   2.2%  1.80Ki    execute_stack_op
       2.1%  1.74Ki   2.1%  1.74Ki    d_expression_1
       2.0%  1.63Ki   2.0%  1.63Ki    d_name
       2.0%  1.60Ki   2.0%  1.60Ki    uw_frame_state_for
       1.8%  1.50Ki   1.8%  1.50Ki    [section .text]
       1.7%  1.43Ki   1.7%  1.43Ki    __gxx_personality_v0
       1.7%  1.42Ki   1.7%  1.42Ki    _Unwind_IteratePhdrCallback
       1.7%  1.40Ki   1.7%  1.40Ki    d_demangle_callback.constprop.0
       1.7%  1.38Ki   1.7%  1.38Ki    d_special_name
       1.5%  1.26Ki   1.5%  1.26Ki    d_unqualified_name
       1.3%  1.06Ki   1.3%  1.06Ki    uw_update_context_1
       1.1%     959   1.1%     959    _Unwind_RaiseException
       1.1%     944   1.1%     944    d_maybe_print_fold_expression
     0.1%      48   0.1%      48    .plt
     0.0%      27   0.0%      27    .init
     0.0%      24   0.0%      24    .plt.got
     0.0%      16   0.0%      16    [LOAD #3 [RX]]
     0.0%      13   0.0%      13    .fini
  36.7%  76.0Ki   0.0%       0    [Unmapped]
    36.2%  27.5Ki   NAN%       0    .debug_info
    14.4%  11.0Ki   NAN%       0    .strtab
      72.1%  7.92Ki   NAN%       0    [262 Others]
       6.1%     691   NAN%       0    [section .strtab]
       1.8%     207   NAN%       0    (anonymous namespace)::get_global()::global
       1.2%     140   NAN%       0    fast_float::(anonymous namespace)::number_of_digits_decimal_left_shift()::number_of_digits_decimal_left_shift_table_powers_of_5
       1.2%     139   NAN%       0    __cxxabiv1::__class_type_info::__do_upcast()
       1.2%     138   NAN%       0    __gnu_cxx::__concurrence_unlock_error::~__concurrence_unlock_error()
       1.2%     135   NAN%       0    __gnu_cxx::__concurrence_unlock_error
       1.2%     132   NAN%       0    __gnu_cxx::__concurrence_lock_error::~__concurrence_lock_error()
       1.1%     129   NAN%       0    __gnu_cxx::__concurrence_lock_error
       1.1%     128   NAN%       0    __cxxabiv1::__si_class_type_info::__do_dyncast()
       1.1%     128   NAN%       0    fast_float::(anonymous namespace)::number_of_digits_decimal_left_shift()::number_of_digits_decimal_left_shift_table
       1.1%     126   NAN%       0    __cxxabiv1::__si_class_type_info::~__si_class_type_info()
       1.1%     123   NAN%       0    __cxxabiv1::__foreign_exception::~__foreign_exception()
       1.1%     123   NAN%       0    __cxxabiv1::__si_class_type_info
       1.1%     122   NAN%       0    __libc_csu_init
       1.1%     120   NAN%       0    __cxxabiv1::__foreign_exception
       1.0%     117   NAN%       0    __cxxabiv1::__class_type_info::~__class_type_info()
       1.0%     114   NAN%       0    __cxxabiv1::__class_type_info
       1.0%     111   NAN%       0    __cxxabiv1::__forced_unwind::~__forced_unwind()
       1.0%     108   NAN%       0    __cxxabiv1::__forced_unwind
       1.0%     107   NAN%       0    __cxxabiv1::__class_type_info::__do_dyncast()
    13.2%  10.0Ki   NAN%       0    .symtab
      64.9%  6.49Ki   NAN%       0    [266 Others]
      17.8%  1.78Ki   NAN%       0    [section .symtab]
       3.5%     360   NAN%       0    (anonymous namespace)::get_global()::global
       1.2%     120   NAN%       0    __gnu_cxx::__verbose_terminate_handler()
       1.2%     120   NAN%       0    __libc_csu_init
       0.9%      96   NAN%       0    stdout@@GLIBC_2.2.5
       0.7%      72   NAN%       0    __cxxabiv1::__class_type_info
       0.7%      72   NAN%       0    __cxxabiv1::__class_type_info::~__class_type_info()
       0.7%      72   NAN%       0    __cxxabiv1::__forced_unwind
       0.7%      72   NAN%       0    __cxxabiv1::__forced_unwind::~__forced_unwind()
       0.7%      72   NAN%       0    __cxxabiv1::__foreign_exception
       0.7%      72   NAN%       0    __cxxabiv1::__foreign_exception::~__foreign_exception()
       0.7%      72   NAN%       0    __cxxabiv1::__si_class_type_info
       0.7%      72   NAN%       0    __cxxabiv1::__si_class_type_info::~__si_class_type_info()
       0.7%      72   NAN%       0    __gnu_cxx::__concurrence_lock_error
       0.7%      72   NAN%       0    __gnu_cxx::__concurrence_lock_error::~__concurrence_lock_error()
       0.7%      72   NAN%       0    __gnu_cxx::__concurrence_unlock_error
       0.7%      72   NAN%       0    __gnu_cxx::__concurrence_unlock_error::~__concurrence_unlock_error()
       0.7%      72   NAN%       0    std::bad_exception
       0.7%      72   NAN%       0    std::bad_exception::~bad_exception()
       0.7%      72   NAN%       0    std::exception
    12.5%  9.51Ki   NAN%       0    .debug_str

So that was it. streams was our culprit.

This is actually not a surprise; I've seen it before. But unfortunately, for most people this will likely come as surprise, even if they have a faint idea of the cost of streams. They should have no place in code that is intended to be lean and fast. They are the exact opposite of that and should be, to paraphrase goto, "streams considered evil". The headers are heavy, the binaries are heavy, and the code is slow. They certainly do not follow C++'s mantra of not paying for what's not used. Streams stand to C++ as slavery once did to society: they are widely used and they may seem an integral part of daily life, but they are evil, and with many people you run a risk of being taken for a lunatic if you point out how evil streams are. Like with slavery, status quo is very strong.

I will now stop the rant, collect myself and press the submit button :-)

@lemire
Copy link
Member

lemire commented Nov 24, 2020

Ok. I actually think we should be able to do away with iostream entirely... this is a good idea.

@lemire
Copy link
Member

lemire commented Nov 24, 2020

Don’t get so worked up! :-)

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

Although I mean everything I wrote above, I sort of wrote that in jest -- trying to be expressive while concise. Maybe the balance went a bit too far to the expressive side :-)

Also, to make it clear, in no way was I directing this at any of fast_float's code. It is entirely about how something like iostream has such a prominent place in C++.

@lemire
Copy link
Member

lemire commented Nov 24, 2020

@biojppm I also dislike streams in C++.

@lemire
Copy link
Member

lemire commented Nov 24, 2020

Merging. I will issue a release.

@lemire lemire merged commit f51af51 into fastfloat:main Nov 24, 2020
@lemire
Copy link
Member

lemire commented Nov 24, 2020

Now that this is merged, I must say that I don't really understand your analysis.

I agree that it is best not to include iostream, so I don't think that including iostream should ever result in megabytes of binary. This does not make sense to me.

No need to explain though because I am not arguing in the least against the PR.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

I'm happy to explain -- I think it's important to understand.

The gist of the analysis is this: compare different alternatives of reading a float/double with regard to compilation time and (more significantly) binary size.

This is accomplished by compiling for each alternative function a single main with the form:

int main()
{
    char buf[BUFSIZE];
    while(fgets(buf, BUFSIZE, stdin))
    {
        fputs(buf, stdout);
        READ_FROM_BUF(buf);
    }
}

For example, the fast_float read is compiled with

#include <c4/ext/fast_float.hpp>
#include <cstring>
double doit(const char *s)
{
    double result;
    fast_float::from_chars(s, s+strlen(s), result);
    return result;
}
#define READ_FROM_BUF(s) (void) doit(s)

whereas stringstream is compiled with

#include <sstream>
float doit(const char *s)
{
   std::stringstream ss;
   ss << s;
   float val;
   ss >> val;
   return val;
}
#define READ_FROM_BUF(s) (void) doit(s)

And of course the baseline executable with no float conversion is compiled with this:

#define READ_FROM_BUF(s)

You can see the entire file in here. This is compiled with the appropriate preprocessor definitions: see the relevant cmake for that.
The resulting executable size is then reported using pre-build and post-build commands and stored into a file. These are the files that I'm catting above.

Then I did this for all combinations of (g++,clang++) x (Release,Debug) x (x86,x86_64). It is a lot of work if done manually, but I have a tool to cleanly automate this. The results above are a cross-panel of each function across the different builds after they complete.

Finally, when diving deeper in the symbols present in the executable, I'm using Bloaty McBloatface.

Let me know if something in particular is unclear.

@lemire
Copy link
Member

lemire commented Nov 24, 2020

Let me know if something in particular is unclear.

I do not understand. Sorry.

I understand how including the header might increase the compilation time. So let us leave that aside.

I do not understand what you mean by binary bloat.

We have binary executables as part of the project, the test files. They do not use nearly 1 MB each. They use the library, evidently. They also use iostream. Look...

$ ls -alh
total 2456
drwxr-xr-x  19 lemire  staff   608B 24 Nov 09:11 .
drwxr-xr-x  10 lemire  staff   320B 23 Nov 18:25 ..
drwxr-xr-x  17 lemire  staff   544B 23 Nov 18:24 CMakeFiles
-rw-r--r--   1 lemire  staff   4.5K 23 Nov 18:24 CTestTestfile.cmake
-rw-r--r--   1 lemire  staff    28K 23 Nov 18:24 Makefile
-rwxr-xr-x   1 lemire  staff   588K 24 Nov 09:11 basictest
-rw-r--r--   1 lemire  staff   1.3K 23 Nov 18:24 cmake_install.cmake
-rwxr-xr-x   1 lemire  staff    35K 24 Nov 09:10 example_test
-rwxr-xr-x   1 lemire  staff    38K 24 Nov 09:09 exhaustive32
-rwxr-xr-x   1 lemire  staff    39K 24 Nov 09:10 exhaustive32_64
-rwxr-xr-x   1 lemire  staff    40K 24 Nov 09:10 exhaustive32_midpoint
-rwxr-xr-x   1 lemire  staff    38K 24 Nov 09:10 long_exhaustive32
-rwxr-xr-x   1 lemire  staff    38K 24 Nov 09:10 long_exhaustive32_64
-rwxr-xr-x   1 lemire  staff    39K 24 Nov 09:11 long_random64
-rwxr-xr-x   1 lemire  staff    45K 24 Nov 09:08 powersoffive_hardround
-rwxr-xr-x   1 lemire  staff    39K 24 Nov 09:11 random64
-rwxr-xr-x   1 lemire  staff    44K 24 Nov 09:10 random_string
-rwxr-xr-x   1 lemire  staff    44K 24 Nov 09:09 short_random_string
-rwxr-xr-x   1 lemire  staff   146K 24 Nov 09:09 string_test

So I don't understand what you are measuring.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

Which compiler+version is that?

@biojppm biojppm deleted the fix/bloated_binary branch November 24, 2020 21:51
@lemire
Copy link
Member

lemire commented Nov 24, 2020

Can you just do this for me...

cmake -B build
cmake --build build
ls -alh build/tests/*

(Adjust accordingly if you are under Visual Studio.)

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

Indeed, those are the sizes I'm seeing as well.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

[24/11/20 22:08:32]--(jobs:1)--(~/proj/fast_float) (fast_float/(HEAD detached at caade69))
[jpmag@mozart] 3032$ ll build/linux-x86_64-gxx*/tests/
build/linux-x86_64-gxx10.2-Debug/tests/:
total 3.9M
2.0M -rwxr-xr-x  1 jpmag jpmag 2.0M Nov 24 22:02 basictest*
4.0K drwxr-xr-x 15 jpmag jpmag 4.0K Nov 24 22:07 CMakeFiles/
4.0K -rw-r--r--  1 jpmag jpmag 1.5K Nov 24 21:58 cmake_install.cmake
8.0K -rw-r--r--  1 jpmag jpmag 4.2K Nov 24 21:58 CTestTestfile.cmake
128K -rwxr-xr-x  1 jpmag jpmag 126K Nov 24 22:02 example_test*
100K -rwxr-xr-x  1 jpmag jpmag  99K Nov 24 22:02 exhaustive32*
140K -rwxr-xr-x  1 jpmag jpmag 137K Nov 24 22:02 exhaustive32_64*
108K -rwxr-xr-x  1 jpmag jpmag 106K Nov 24 22:02 exhaustive32_midpoint*
100K -rwxr-xr-x  1 jpmag jpmag  99K Nov 24 22:02 long_exhaustive32*
100K -rwxr-xr-x  1 jpmag jpmag  99K Nov 24 22:02 long_exhaustive32_64*
104K -rwxr-xr-x  1 jpmag jpmag 104K Nov 24 22:02 long_random64*
 32K -rw-r--r--  1 jpmag jpmag  31K Nov 24 21:58 Makefile
168K -rwxr-xr-x  1 jpmag jpmag 165K Nov 24 22:02 powersoffive_hardround*
104K -rwxr-xr-x  1 jpmag jpmag 104K Nov 24 22:02 random64*
148K -rwxr-xr-x  1 jpmag jpmag 145K Nov 24 22:02 random_string*
148K -rwxr-xr-x  1 jpmag jpmag 145K Nov 24 22:02 short_random_string*
460K -rwxr-xr-x  1 jpmag jpmag 457K Nov 24 22:02 string_test*

build/linux-x86_64-gxx10.2-Release/tests/:
total 1.2M
584K -rwxr-xr-x  1 jpmag jpmag 582K Nov 24 22:02 basictest*
4.0K drwxr-xr-x 15 jpmag jpmag 4.0K Nov 24 22:07 CMakeFiles/
4.0K -rw-r--r--  1 jpmag jpmag 1.5K Nov 24 21:58 cmake_install.cmake
8.0K -rw-r--r--  1 jpmag jpmag 4.2K Nov 24 21:58 CTestTestfile.cmake
 36K -rwxr-xr-x  1 jpmag jpmag  35K Nov 24 22:02 example_test*
 36K -rwxr-xr-x  1 jpmag jpmag  35K Nov 24 22:02 exhaustive32*
 40K -rwxr-xr-x  1 jpmag jpmag  40K Nov 24 22:02 exhaustive32_64*
 40K -rwxr-xr-x  1 jpmag jpmag  40K Nov 24 22:02 exhaustive32_midpoint*
 40K -rwxr-xr-x  1 jpmag jpmag  39K Nov 24 22:02 long_exhaustive32*
 40K -rwxr-xr-x  1 jpmag jpmag  39K Nov 24 22:02 long_exhaustive32_64*
 40K -rwxr-xr-x  1 jpmag jpmag  39K Nov 24 22:02 long_random64*
 32K -rw-r--r--  1 jpmag jpmag  31K Nov 24 21:58 Makefile
 48K -rwxr-xr-x  1 jpmag jpmag  46K Nov 24 22:02 powersoffive_hardround*
 40K -rwxr-xr-x  1 jpmag jpmag  39K Nov 24 22:02 random64*
 44K -rwxr-xr-x  1 jpmag jpmag  44K Nov 24 22:02 random_string*
 44K -rwxr-xr-x  1 jpmag jpmag  44K Nov 24 22:02 short_random_string*
140K -rwxr-xr-x  1 jpmag jpmag 138K Nov 24 22:02 string_test*

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

and similar for clang

@lemire
Copy link
Member

lemire commented Nov 24, 2020

Right. So of course, the debug builds are fat, but that's fine.

Now you might think "35KB is a lot" but fast_float is not itself responsible for all of the 35KB. Only maybe half.

(Let us be clear: it is still a good idea to remove unneeded headers.)

@biojppm
Copy link
Contributor Author

biojppm commented Nov 24, 2020

Yes, the sizes are reasonable. I even compiled the equivalent to my test inside and the results are still small:

Head:     caade69 Merge pull request #28 from lemire/dlemire/aqrit_magic
Tags:     v0.2.0 (44), v0.3.0 (3)

Staged changes (4)
modified   tests/CMakeLists.txt
@@ -40,3 +40,7 @@ fast_float_add_cpp_test(long_random64)
 fast_float_add_cpp_test(random64)
 fast_float_add_cpp_test(basictest)
 fast_float_add_cpp_test(example_test)
+
+fast_float_add_cpp_test(bloat_baseline)
+fast_float_add_cpp_test(bloat_iostream)
+fast_float_add_cpp_test(bloat_fastfloat)
new file   tests/bloat_baseline.cpp
@@ -0,0 +1,12 @@
+#include <cstdio>
+
+int main()
+{
+    #define BUFSIZE 128
+    char buf[BUFSIZE];
+    while(fgets(buf, BUFSIZE, stdin))
+    {
+        fputs(buf, stdout);
+        (void) 0;
+    }
+}
new file   tests/bloat_fastfloat.cpp
@@ -0,0 +1,21 @@
+#include <cstdio>
+#include <cstring>
+#include <fast_float/fast_float.h>
+
+float doit(const char *s)
+{
+    float result;
+    fast_float::from_chars(s, s+strlen(s), result);
+    return result;
+}
+
+int main()
+{
+    #define BUFSIZE 128
+    char buf[BUFSIZE];
+    while(fgets(buf, BUFSIZE, stdin))
+    {
+        fputs(buf, stdout);
+        (void) doit(buf);
+    }
+}
new file   tests/bloat_iostream.cpp
@@ -0,0 +1,23 @@
+#include <cstdio>
+#include <sstream>
+
+
+float doit(const char *s)
+{
+   std::stringstream ss;
+   ss << s;
+   float val;
+   ss >> val;
+   return val;
+}
+
+int main()
+{
+    #define BUFSIZE 128
+    char buf[BUFSIZE];
+    while(fgets(buf, BUFSIZE, stdin))
+    {
+        fputs(buf, stdout);
+        (void) doit(buf);
+    }
+}

resulting in this:

[24/11/20 22:19:26]--(jobs:2)--(~/proj/fast_float) (fast_float/(HEAD detached at caade69))
[jpmag@mozart] 3037$ ll build/linux-x86_64-*/tests/bloat*
20K -rwxr-xr-x 1 jpmag jpmag 20K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Debug/tests/bloat_baseline*
92K -rwxr-xr-x 1 jpmag jpmag 90K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Debug/tests/bloat_fastfloat*
32K -rwxr-xr-x 1 jpmag jpmag 30K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Debug/tests/bloat_iostream*
20K -rwxr-xr-x 1 jpmag jpmag 17K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Release/tests/bloat_baseline*
36K -rwxr-xr-x 1 jpmag jpmag 35K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Release/tests/bloat_fastfloat*
20K -rwxr-xr-x 1 jpmag jpmag 18K Nov 24 22:19 build/linux-x86_64-clangxx11.0-Release/tests/bloat_iostream*
24K -rwxr-xr-x 1 jpmag jpmag 21K Nov 24 22:19 build/linux-x86_64-gxx10.2-Debug/tests/bloat_baseline*
96K -rwxr-xr-x 1 jpmag jpmag 94K Nov 24 22:19 build/linux-x86_64-gxx10.2-Debug/tests/bloat_fastfloat*
36K -rwxr-xr-x 1 jpmag jpmag 35K Nov 24 22:19 build/linux-x86_64-gxx10.2-Debug/tests/bloat_iostream*
20K -rwxr-xr-x 1 jpmag jpmag 17K Nov 24 22:19 build/linux-x86_64-gxx10.2-Release/tests/bloat_baseline*
36K -rwxr-xr-x 1 jpmag jpmag 34K Nov 24 22:19 build/linux-x86_64-gxx10.2-Release/tests/bloat_fastfloat*
20K -rwxr-xr-x 1 jpmag jpmag 19K Nov 24 22:19 build/linux-x86_64-gxx10.2-Release/tests/bloat_iostream*

@lemire
Copy link
Member

lemire commented Nov 25, 2020

Ok.

In any case, your PR was good no matter what.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 25, 2020

I double checked: the bloated sizes I reported previously are definitely correct, and absolutely they go away when I remove the include of iostream. How to reconcile this?

Given that the input code is actually the same but the sizes differ, we must be led to think it comes down to how the files are compiled. Indeed the compilation lines differ between the projects, and the main difference is dynamic vs static linking. But I am getting ahead of myself - let's watch the movie without spoilers.

Here's first the complete line from inside fast_float:

# output stripped for clarity
[24/11/20 22:35:37]--(jobs:2)--(~/proj/fast_float) (fast_float/(HEAD detached at caade69))
[jpmag@mozart] 3041$ ( cd build/linux-x86_64-gxx10.2-Debug/tests ; make VERBOSE=1 -B bloat_fastfloat && echo && ll CMakeFiles/bloat_fastfloat.dir/*.o bloat_fastfloat )
[ 50%] Building CXX object tests/CMakeFiles/bloat_baseline.dir/bloat_baseline.cpp.o
/usr/bin/g++  -I/opt/jpmag/proj/fast_float/include -I/opt/jpmag/proj/fast_float/build/linux-x86_64-gxx10.2-Debug/_deps/doctest-src -m64 -g -Werror -Wall -Wextra -Weffc++ -Wsign-compare -Wshadow -Wwrite-strings -Wpointer-arith -Winit-self -Wconversion -Wsign-conversion -std=gnu++11 -o CMakeFiles/bloat_baseline.dir/bloat_baseline.cpp.o -c /opt/jpmag/proj/fast_float/tests/bloat_baseline.cpp
[100%] Linking CXX executable bloat_baseline
/usr/bin/g++     -m64 -g CMakeFiles/bloat_baseline.dir/bloat_baseline.cpp.o -o bloat_baseline

 96K -rwxr-xr-x 1 jpmag jpmag 94K Nov 24 22:58 bloat_fastfloat*
100K -rw-r--r-- 1 jpmag jpmag 97K Nov 24 22:58 CMakeFiles/bloat_fastfloat.dir/bloat_fastfloat.cpp.o

Now from my project (using the same fast_float commit as above, pre-merge, caade69):

# output stripped for clarity
[24/11/20 23:01:35]--(jobs:0)--(/opt/jpmag/proj/c4core) (c4core.git/master)
[jpmag@mozart] 3018$ ( cd build/linux-x86_64-gxx10.2-Debug/bm/float ; make VERBOSE=1 -B c4core-bm-readfloat-fast_float_f && echo && ll CMakeFiles/*fast_float_f.dir/*o c4core-bm-readfloat-fast_float_f )
/usr/bin/g++ -DC4FLOAT_FASTFLOAT_F=1 -I/opt/jpmag/proj/c4core/src -m64 -g -Werror -pedantic-errors -fstrict-aliasing -Wall -Wextra -pedantic -Wshadow -Wnon-virtual-dtor -Wcast-align -Wunused -Woverloaded-virtual -Wpedantic -Wconversion -Wsign-conversion -Wdouble-promotion -Wfloat-equal -Wformat=2 -Wlogical-op -Wuseless-cast -std=c++11 -o CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -c /opt/jpmag/proj/c4core/bm/float/read.cpp
[100%] Linking CXX executable c4core-bm-readfloat-fast_float_f
/usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f  -static-libgcc -static-libstdc++ 

1.4M -rwxr-xr-x 1 jpmag jpmag 1.4M Nov 24 23:01 c4core-bm-readfloat-fast_float_f*
100K -rw-r--r-- 1 jpmag jpmag  97K Nov 24 23:01 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o

Let me paste both again, but stripping out flags for warnings and includes:

#fast_float
/usr/bin/g++  -m64 -g -std=gnu++11 -o .../bloat_baseline.cpp.o -c ..../bloat_baseline.cpp
/usr/bin/g++     -m64 -g .../bloat_baseline.cpp.o -o bloat_baseline
 96K -rwxr-xr-x 1 jpmag jpmag 94K Nov 24 22:58 bloat_fastfloat*
100K -rw-r--r-- 1 jpmag jpmag 97K Nov 24 22:58 CMakeFiles/bloat_fastfloat.dir/bloat_fastfloat.cpp.o

#myproj
/usr/bin/g++ -m64 -g -fstrict-aliasing -std=c++11 -o .../read.cpp.o -c .../read.cpp
/usr/bin/g++     -m64 -g .../read.cpp.o -o c4core-bm-readfloat-fast_float_f  -static-libgcc -static-libstdc++
1.4M -rwxr-xr-x 1 jpmag jpmag 1.4M Nov 24 23:01 c4core-bm-readfloat-fast_float_f*
100K -rw-r--r-- 1 jpmag jpmag  97K Nov 24 23:01 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o

In the compile line the only relevant difference is that your project uses compiler extensions, gnu++11. That should not cause the dramatic difference, though. The KB size of the object file is exactly the same, and indeed it should be.

However, the final result is dramatically different, and I am convinced the difference comes down to the link step. If you notice, I am requesting a static link of the standard library through -static-libgcc -static-libstdc++. Here's the result if I remove these flags:

[24/11/20 23:34:44]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3034$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o ; ldd $exe ; readelf -s $exe )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
 96K -rwxr-xr-x 1 jpmag jpmag 94K Nov 24 23:35 c4core-bm-readfloat-fast_float_f*
100K -rw-r--r-- 1 jpmag jpmag 97K Nov 24 23:30 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
+ ldd c4core-bm-readfloat-fast_float_f
        linux-vdso.so.1 (0x00007ffe738e5000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbbc39e3000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007fbbc389d000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fbbc3883000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fbbc36ba000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fbbc3c2a000)
+ readelf -s c4core-bm-readfloat-fast_float_f

Symbol table '.dynsym' contains 17 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND [...]@GLIBC_2.2.5 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __[...]@GLIBC_2.4 (3)
     6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fputs@GLIBC_2.2.5 (2)
     7: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBCXX_3.4 (4)
     8: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fgets@GLIBC_2.2.5 (2)
     9: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _[...]@CXXABI_1.3 (5)
    10: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterT[...]
    11: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    12: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
    13: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMC[...]
    14: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBCXX_3.4 (4)
    15: 000000000000a070     8 OBJECT  GLOBAL DEFAULT   25 [...]@GLIBC_2.2.5 (2)
    16: 000000000000a080     8 OBJECT  GLOBAL DEFAULT   25 stdin@GLIBC_2.2.5 (2)

Notice that now the executable size is exactly the same as yours! When we link dynamically we are ignoring all the standard code that our executable is bringing in. That's why I started my analysis with static flags.

Now let's add -static-libgcc -static-libstdc++:

[24/11/20 23:35:03]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3035$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe -static-libgcc -static-libstdc++ ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o ; ldd $exe ; readelf -s $exe )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f -static-libgcc -static-libstdc++
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
1.4M -rwxr-xr-x 1 jpmag jpmag 1.4M Nov 24 23:36 c4core-bm-readfloat-fast_float_f*
100K -rw-r--r-- 1 jpmag jpmag  97K Nov 24 23:30 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
+ ldd c4core-bm-readfloat-fast_float_f
        linux-vdso.so.1 (0x00007ffcee192000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007efd06888000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007efd066bf000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007efd06b2b000)
+ readelf -s c4core-bm-readfloat-fast_float_f

Symbol table '.dynsym' contains 111 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_addUserComm[...]
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_memcpyRtWn
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
....
    12: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    13: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND close@GLIBC_2.2.5 (2)
    14: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    15: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND ioctl@GLIBC_2.2.5 (2)
    16: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (2)
    17: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    18: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
....
    27: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND read@GLIBC_2.2.5 (2)
    28: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
...
    34: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fgets@GLIBC_2.2.5 (2)
...
    37: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND pthread_once
    38: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    39: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    40: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterT[...]
    41: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ZGTtdlPv
    42: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    43: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    44: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fputc@GLIBC_2.2.5 (2)
    45: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (2)
    46: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMC[...]
    47: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    48: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    49: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    50: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND wctob@GLIBC_2.2.5 (2)
    51: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __[...]@GLIBC_2.3 (3)
...
    57: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND iconv@GLIBC_2.2.5 (2)
    58: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_RU8
    59: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    60: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND poll@GLIBC_2.2.5 (2)
...
    64: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND putwc@GLIBC_2.2.5 (2)
    65: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND putc@GLIBC_2.2.5 (2)
    66: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    67: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    68: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fread@GLIBC_2.2.5 (2)
    69: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
    70: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_memcpyRnWt
    71: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
...
    77: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __pthread_key_create
    78: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND getwc@GLIBC_2.2.5 (2)
...
   106: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fputs@GLIBC_2.2.5 (2)
   107: 00000000000fa1c8     8 OBJECT  GLOBAL DEFAULT   29 [...]@GLIBC_2.2.5 (2)
   108: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND [...]@GLIBC_2.2.5 (2)
   109: 00000000000fa1c0     8 OBJECT  GLOBAL DEFAULT   29 stdin@GLIBC_2.2.5 (2)
   110: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __[...]@GLIBC_2.4 (5)

So indeed the executable is 1.4MB. But it is still bringing in dynamic symbols. That's because I forgot to -static flag. So let's add it on top:

[24/11/20 23:36:05]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3036$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe -static -static-libgcc -static-libstdc++ ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o ; ldd $exe ; readelf -s $exe )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f -static -static-libgcc -static-libstdc++
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
2.3M -rwxr-xr-x 1 jpmag jpmag 2.3M Nov 24 23:37 c4core-bm-readfloat-fast_float_f*
100K -rw-r--r-- 1 jpmag jpmag  97K Nov 24 23:30 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
+ ldd c4core-bm-readfloat-fast_float_f
        not a dynamic executable
+ readelf -s c4core-bm-readfloat-fast_float_f
# .dynsym is empty

Notice that this now uses no dynamic symbols anymore. So the executable size with all the required code in it is actually 2.3MB, even worse than the initial 1.4MB I was reporting.

The static vs dynamic question is the origin of this confusion. Dynamic sizes can be misleading.

One final point: the commit under analysis in this post still has the include. Let's try now with the tip of this MR:

First, dynamic (no static flags):

[24/11/20 23:57:33]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3041$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
96K -rwxr-xr-x 1 jpmag jpmag 93K Nov 24 23:57 c4core-bm-readfloat-fast_float_f*
96K -rw-r--r-- 1 jpmag jpmag 95K Nov 24 23:56 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o

So the dynamic executable without the include is the same size as with the include at 93KB (maybe 1KB shorter, didn't check). No effect from removing the header.

Now let's add -static-libgcc -static-libstdc++:

[24/11/20 23:57:30]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3040$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe -static-libgcc -static-libstdc++ ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f -static-libgcc -static-libstdc++
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
212K -rwxr-xr-x 1 jpmag jpmag 209K Nov 24 23:57 c4core-bm-readfloat-fast_float_f*
 96K -rw-r--r-- 1 jpmag jpmag  95K Nov 24 23:56 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o

Now the executable went from 1.4MB with include to 209KB without include, as reported at the beginning of this MR.

Finally, full static with -static -static-libgcc -static-libstdc++:

[24/11/20 23:57:08]--(jobs:0)--(/opt/jpmag/proj/c4core/build/linux-x86_64-gxx10.2-Debug/bm/float)
[jpmag@mozart] 3039$ (set -x ; exe=c4core-bm-readfloat-fast_float_f ; /usr/bin/g++     -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o $exe -static -static-libgcc -static-libstdc++ ; ll $exe CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o )
+ exe=c4core-bm-readfloat-fast_float_f
+ /usr/bin/g++ -m64 -g CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o -o c4core-bm-readfloat-fast_float_f -static -static-libgcc -static-libstdc++
+ ls --color=auto --color=auto -lFhs c4core-bm-readfloat-fast_float_f CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o
936K -rwxr-xr-x 1 jpmag jpmag 934K Nov 24 23:57 c4core-bm-readfloat-fast_float_f*
 96K -rw-r--r-- 1 jpmag jpmag  95K Nov 24 23:56 CMakeFiles/c4core-bm-readfloat-fast_float_f.dir/read.cpp.o

So the final static, all-in, executable size went from 2.3MB with include to 934K without include - a difference of ~1.2MB consistent with the case above, which together with the previous set of flags reliably proves that the include of iostreams was costing ~1.2MB of binary size.

For a moment there I was also confused, but I think this explains the differences we were observing. I definitely learned something while trying to figure this.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 25, 2020

In any case, your PR was good no matter what.

I knew it was. But for a moment I was worried I would not be able to convincingly prove that it was. Bold statements like that of the title require sure proof.

@biojppm
Copy link
Contributor Author

biojppm commented Nov 25, 2020

@lemire let me know if there's still something not clear.

I'm curious to investigate bloaty's analysis on the several programs above, and how tweaks to the code (eg fputs_unlocked() vs fputs() cause the results to vary). I noticed a lot of mutex symbols above, which is sad because again it means by default we're paying for something even if we're not using it. But unfortunately I'm really taken up and cannot spend any time on this ATM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants