Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: 100 mb of wikipedia compressed to 1kb, lossless compression

  1. #1

    100 mb of wikipedia compressed to 1kb, lossless compression

    99.9999% compression ratio using Thinbasic.
    lossless compression.

    100 MB of wikipedia compressed to 1kb.

    next, to test this week: compression ratio for 1 GB

    theorically, this new codec, can compress GB, TB, PB in a few KB.

    Last edited by alberto; 16-07-2019 at 22:38. Reason: typo

  2. #2
    thinBasic author ErosOlmi's Avatar
    Join Date
    Sep 2004
    Location
    Milan - Italy
    Age
    57
    Posts
    8,777
    Rep Power
    10
    Are you kidding us? Please produce some proofs.
    www.thinbasic.com | www.thinbasic.com/community/ | help.thinbasic.com
    Windows 10 Pro for Workstations 64bit - 32 GB - Intel(R) Xeon(R) W-10855M CPU @ 2.80GHz - NVIDIA Quadro RTX 3000

  3. #3
    Hi Eros,

    the background that made it possible is here: https://largestprimes.xyz

    I am participating to the Prize for Compressing Human Knowledge with this codec.
    They will test it a month, and evaluate if we broke their record.


    it seems that their current compression record is near 15 mb for a 100mb wikipedia file using codecs phda9, decomp8,paq8


    As I was working with very large numbers for prime, then it was easy to compress 100 million digits using thinbasic.



  4. #4
    thinBasic author ErosOlmi's Avatar
    Join Date
    Sep 2004
    Location
    Milan - Italy
    Age
    57
    Posts
    8,777
    Rep Power
    10
    www.thinbasic.com | www.thinbasic.com/community/ | help.thinbasic.com
    Windows 10 Pro for Workstations 64bit - 32 GB - Intel(R) Xeon(R) W-10855M CPU @ 2.80GHz - NVIDIA Quadro RTX 3000

  5. #5
    thinBasic author ErosOlmi's Avatar
    Join Date
    Sep 2004
    Location
    Milan - Italy
    Age
    57
    Posts
    8,777
    Rep Power
    10
    This is my very little contribution to the challenge.

    Attached script perform the following:
    1. download zipped file used for the challenge if not already present in current script directory
    2. extract included file into a string buffer of 100MB
    3. compress it into a new string
    4. report results .... very poor compared to current challenge results


    Ciao
    Eros


    Capture.PNG


      uses "ZLib"
      Uses "File"
      uses "console"
      uses "inet"
     
      printl "---------------------------------------------------------------"
      printl "Challenge: https://en.wikipedia.org/wiki/Hutter_Prize"
      printl "           http://prize.hutter1.net/"
      printl "---------------------------------------------------------------"
      printl "download zipped file used for the challenge if not already present in current script directory"
      printl "extract included file into a string buffer of 100MB"
      printl "compress it into a new string"
      printl "report results .... very poor compared to current  challenge results"  
      printl "---------------------------------------------------------------"
      PrintL
      printl "Press any key to Start---" IN %CCOLOR_FYELLOW
      WaitKey
      
      string sUrlZipFile = "http://mattmahoney.net/dc/enwik8.zip"
      string sLocalZipFileName = APP_SourcePath & "enwik8.zip"
    
    
      printl "---Start downlaoding", sUrlZipFile
      if FILE_Exists(sLocalZipFileName) Then
        printl "---File already downloaded"
      Else
        printl "   Dowloading ..."
        INET_UrlDownload(sUrlZipFile, APP_SourcePath & "enwik8.zip")
      end if
      printl "   Local file name", sLocalZipFileName
      PrintL
      
      string sUncompressedFileName = "enwik8"
      printl "---Extracting " & sUncompressedFileName & " to string"
      printL "   start", Time$
      string sOriginal = ZLib_ExtractToString(sLocalZipFileName, "enwik8")
      printL "   end", Time$
    
    
      printl "   Extraction done. Size of string uncompressed:", LenF(sOriginal)
      printl 
      
      string sCompress
      printl "---Start compressing", Time$
      sCompress = StrZip$(sOriginal)
      printl "   End compressing", Time$
    
    
      printl "   Len Original string.....", lenf(sOriginal)
      printl "   Len compressed string...", lenf(sCompress)
    
    
      PrintL
      printl "Press any key to end---" IN %CCOLOR_FYELLOW
      WaitKey
    
    Attached Files Attached Files
    www.thinbasic.com | www.thinbasic.com/community/ | help.thinbasic.com
    Windows 10 Pro for Workstations 64bit - 32 GB - Intel(R) Xeon(R) W-10855M CPU @ 2.80GHz - NVIDIA Quadro RTX 3000

  6. #6
    Hi Eros,

    Yes, exactly, that is the prize.


  7. #7
    Thank you Eros for your experience & contribution.

    Thinbasic is very powerful
    lots of commands to learn...


  8. #8

    Question

    Quote Originally Posted by alberto View Post
    99.9999% compression ratio using Thinbasic.
    lossless compression.

    100 MB of wikipedia compressed to 1kb.

    next, to test this week: compression ratio for 1 GB

    theorically, this new codec, can compress GB, TB, PB in a few KB.

    Hi Alberto,
    how is the 1kb (kilobits ?) compared to the Shannon entropy of the 100MB ?
    ThinBasic 1.11.6.0 ALPHA - Windows 8.1 x64

  9. #9
    Hi,

    good question,

    they say their data 100mb: enwik8 is fairly uniform.

    their link "Information about the enwik8 data file" is:

    http://mattmahoney.net/dc/textdata.html

    you will find there detailed information about the data, statistics, and graphics of the distribution of the data too:


    This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 10power9 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006

    enwik8: compressed size of first 108 bytes of enwik9. This data is used for the Hutter Prize, and is also ranked here but has no effect on this ranking.
    enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml

    they have been benchmarking well known codecs, for years.


  10. #10
    Hi,


    i wonder if you mean to the certainty of the outcomes of the compressed files generated.

    then the entropy is zero.

    " Entropy is zero whenone outcome is certain".

    http://basicknowledge101.com/pdf/km/...%20theory).pdf
    2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.
    Entropy is zero whenone outcome is certain.


    it is the first time I read about shannon entropy, its good to learn each day something.

    thanks


Page 1 of 2 12 LastLast

Similar Threads

  1. wikipedia ask for donations
    By zak in forum Shout Box Area
    Replies: 1
    Last Post: 19-11-2010, 22:42
  2. USB Datapens and Compressed bundles
    By Michael Clease in forum thinBundle bugs report
    Replies: 4
    Last Post: 30-09-2009, 17:19
  3. CWAD (AKA: Compressed, Where's All the Data)
    By ISAWHIM in forum User files and/or user projects
    Replies: 15
    Last Post: 23-10-2008, 17:47
  4. Replies: 13
    Last Post: 17-07-2007, 23:14

Members who have read this thread: 3

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •