Stop Thinking, Just Do!

Sungsoo Kim's Blog

Fun with TPC-H (Part 1)

tagsTags

14 July 2014


Article Source

Fun with TPC-H (Part 1)

This’ll be the first in a series of articles I’m planning on writing about the TPC-H Benchmark, and SQL Server.

Recently I attended an HP/Microsoft event, where they discussed the new offerings from the two companies, around the FastTrack Data Warehouse.

Having spent some time reading this document, and looking at the benchmarks in there (BCR – Benchmark Consumption Rate, MCR – Maximum CPU Core Consumption Rate, UDC – Required User Data Capacity and more importantly the QphH), I thought I’d look into them a little more.

I’m going to cover the experiences I had in getting this up and running, which was interesting since I couldn’t find anywhere where there was a reasonable example that ‘anybody’ could use to build a database using this dataset. Also, I wanted to run the benchmarks and see where my laptop was, compared with some of the meatier servers on the TPC-H Top Tens.

To carry these tests out, I’m running on my laptop (Sony Vaio, with i5-430M (2 cores, 4 threads at 2.27Ghz) and 8Gb RAM). I’m aware that this is significantly different to the specification of some of these servers, however, there is also a significant difference in price, and it’s an interesting (to some) test…

Overview

The benchmarks are based on a standard set of benchmark data, the TPC-H data, which is available to download here.

“The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.” – TPC Website

Getting Started – DBGen

To get the dataset, you first need to download the DBGEN Reference data set, which is available from the link above. The dataset is built using an application, which the download contains, in C++ form. I opened the project files using Visual Studio 2010, built them, and got a resulting dbgen.exe file. This was much more straightforward than I was expecting.

There are a number of parameters for this application, but if you run it with the default settings, you get the 1gb dataset. If you use the –s parameter (so the scale factor) set to 10 (so –s10), you get a 10Gb dataset. From the TPC-H Results list, the results are all from 100 upwards (so –s100, to get a 100gb dataset).

Getting Started – SQL Server

Next up, I need to create a database. For the initial test, I created the database on an external drive (connected through eSATA). I used the following script.

    CREATE DATABASE [TPCH] ON  PRIMARY
    ( NAME = N'tpch', FILENAME = N'X:\TPC-H\tpch1gb.mdf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB ),
    ( NAME = N'tpch_2', FILENAME = N'X:\TPC-H\tpch1gb_2.ndf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB )
    LOG ON ( NAME = N'tpch_log', FILENAME = N'X:\TPC-H\tpch1gb_log.ldf' , SIZE = 512MB , MAXSIZE = 2048GB , FILEGROWTH = 100MB )
    GO

Database tables were created using the following script, which is all built using the schema available in the TPC-H documentation.

    CREATE TABLE [dbo].[customer](
        [C_CustKey] [int] NULL,
        [C_Name] [varchar](64) NULL,
        [C_Address] [varchar](64) NULL,
        [C_NationKey] [int] NULL,
        [C_Phone] [varchar](64) NULL,
        [C_AcctBal] [decimal](13, 2) NULL,
        [C_MktSegment] [varchar](64) NULL,
        [C_Comment] [varchar](120) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[lineitem](
        [L_OrderKey] [int] NULL,
        [L_PartKey] [int] NULL,
        [L_SuppKey] [int] NULL,
        [L_LineNumber] [int] NULL,
        [L_Quantity] [int] NULL,
        [L_ExtendedPrice] [decimal](13, 2) NULL,
        [L_Discount] [decimal](13, 2) NULL,
        [L_Tax] [decimal](13, 2) NULL,
        [L_ReturnFlag] [varchar](64) NULL,
        [L_LineStatus] [varchar](64) NULL,
        [L_ShipDate] [datetime] NULL,
        [L_CommitDate] [datetime] NULL,
        [L_ReceiptDate] [datetime] NULL,
        [L_ShipInstruct] [varchar](64) NULL,
        [L_ShipMode] [varchar](64) NULL,
        [L_Comment] [varchar](64) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[nation](
        [N_NationKey] [int] NULL,
        [N_Name] [varchar](64) NULL,
        [N_RegionKey] [int] NULL,
        [N_Comment] [varchar](160) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[orders](
        [O_OrderKey] [int] NULL,
        [O_CustKey] [int] NULL,
        [O_OrderStatus] [varchar](64) NULL,
        [O_TotalPrice] [decimal](13, 2) NULL,
        [O_OrderDate] [datetime] NULL,
        [O_OrderPriority] [varchar](15) NULL,
        [O_Clerk] [varchar](64) NULL,
        [O_ShipPriority] [int] NULL,
        [O_Comment] [varchar](80) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[part](
        [P_PartKey] [int] NULL,
        [P_Name] [varchar](64) NULL,
        [P_Mfgr] [varchar](64) NULL,
        [P_Brand] [varchar](64) NULL,
        [P_Type] [varchar](64) NULL,
        [P_Size] [int] NULL,
        [P_Container] [varchar](64) NULL,
        [P_RetailPrice] [decimal](13, 2) NULL,
        [P_Comment] [varchar](64) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[partsupp](
        [PS_PartKey] [int] NULL,
        [PS_SuppKey] [int] NULL,
        [PS_AvailQty] [int] NULL,
        [PS_SupplyCost] [decimal](13, 2) NULL,
        [PS_Comment] [varchar](200) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[region](
        [R_RegionKey] [int] NULL,
        [R_Name] [varchar](64) NULL,
        [R_Comment] [varchar](160) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO
    CREATE TABLE [dbo].[supplier](
        [S_SuppKey] [int] NULL,
        [S_Name] [varchar](64) NULL,
        [S_Address] [varchar](64) NULL,
        [S_NationKey] [int] NULL,
        [S_Phone] [varchar](18) NULL,
        [S_AcctBal] [decimal](13, 2) NULL,
        [S_Comment] [varchar](105) NULL,
        [skip] [varchar](64) NULL
    ) ON [PRIMARY]
    GO

Finally, we get to load the data. I had the source files in the ‘C:\TPC-H\source\’ directory, loading into the database on the external drive (to reduce I/O contention).

The data was loaded using the following:

    BULK INSERT part FROM 'C:\TPC-H\source\part.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT customer FROM 'C:\TPC-H\source\customer.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT orders FROM 'C:\TPC-H\source\orders.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT partsupp FROM 'C:\TPC-H\source\partsupp.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT supplier FROM 'c:\TPC-H\source\supplier.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT lineitem FROM 'C:\TPC-H\source\lineitem.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT nation FROM 'C:\TPC-H\source\nation.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
    BULK INSERT region FROM 'C:\TPC-H\source\region.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')

 This data was loaded in the following times

Table 1Gb Dataset 10Gb Dataset
Customers 150,000 1,500,000
Line Items 6,001,215 59,986,052
Nation 25 25
Orders 1,500,000 15,000,000
Part 200,000 2,000,000
PartSupp (Part Supplier) 800,000 8,000,000
Supplier 10,000 100,000
Region 5 5
Resulting source file size 1.03Gb 10.5Gb
Time to load 50 seconds 9 minutes 56 seconds
Resulting SQL DB Size 1.13Gb 11.38Gb

This gives us a basis from which to run the benchmarks, and do some optimisations.

Next time, I’ll look at running through some of the optimisations, and get some benchmarks out of this system.


comments powered by Disqus