Article Source
- Title: Fun with TPC-H (Part 1)
Fun with TPC-H (Part 1)
This’ll be the first in a series of articles I’m planning on writing about the TPC-H Benchmark, and SQL Server.
Recently I attended an HP/Microsoft event, where they discussed the new offerings from the two companies, around the FastTrack Data Warehouse.
Having spent some time reading this document, and looking at the benchmarks in there (BCR – Benchmark Consumption Rate, MCR – Maximum CPU Core Consumption Rate, UDC – Required User Data Capacity and more importantly the QphH), I thought I’d look into them a little more.
I’m going to cover the experiences I had in getting this up and running, which was interesting since I couldn’t find anywhere where there was a reasonable example that ‘anybody’ could use to build a database using this dataset. Also, I wanted to run the benchmarks and see where my laptop was, compared with some of the meatier servers on the TPC-H Top Tens.
To carry these tests out, I’m running on my laptop (Sony Vaio, with i5-430M (2 cores, 4 threads at 2.27Ghz) and 8Gb RAM). I’m aware that this is significantly different to the specification of some of these servers, however, there is also a significant difference in price, and it’s an interesting (to some) test…
Overview
The benchmarks are based on a standard set of benchmark data, the TPC-H data, which is available to download here.
“The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.” – TPC Website
Getting Started – DBGen
To get the dataset, you first need to download the DBGEN Reference data set, which is available from the link above. The dataset is built using an application, which the download contains, in C++ form. I opened the project files using Visual Studio 2010, built them, and got a resulting dbgen.exe file. This was much more straightforward than I was expecting.
There are a number of parameters for this application, but if you run it with the default settings, you get the 1gb dataset. If you use the –s parameter (so the scale factor) set to 10 (so –s10), you get a 10Gb dataset. From the TPC-H Results list, the results are all from 100 upwards (so –s100, to get a 100gb dataset).
Getting Started – SQL Server
Next up, I need to create a database. For the initial test, I created the database on an external drive (connected through eSATA). I used the following script.
CREATE DATABASE [TPCH] ON PRIMARY
( NAME = N'tpch', FILENAME = N'X:\TPC-H\tpch1gb.mdf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB ),
( NAME = N'tpch_2', FILENAME = N'X:\TPC-H\tpch1gb_2.ndf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB )
LOG ON ( NAME = N'tpch_log', FILENAME = N'X:\TPC-H\tpch1gb_log.ldf' , SIZE = 512MB , MAXSIZE = 2048GB , FILEGROWTH = 100MB )
GO
Database tables were created using the following script, which is all built using the schema available in the TPC-H documentation.
CREATE TABLE [dbo].[customer](
[C_CustKey] [int] NULL,
[C_Name] [varchar](64) NULL,
[C_Address] [varchar](64) NULL,
[C_NationKey] [int] NULL,
[C_Phone] [varchar](64) NULL,
[C_AcctBal] [decimal](13, 2) NULL,
[C_MktSegment] [varchar](64) NULL,
[C_Comment] [varchar](120) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[lineitem](
[L_OrderKey] [int] NULL,
[L_PartKey] [int] NULL,
[L_SuppKey] [int] NULL,
[L_LineNumber] [int] NULL,
[L_Quantity] [int] NULL,
[L_ExtendedPrice] [decimal](13, 2) NULL,
[L_Discount] [decimal](13, 2) NULL,
[L_Tax] [decimal](13, 2) NULL,
[L_ReturnFlag] [varchar](64) NULL,
[L_LineStatus] [varchar](64) NULL,
[L_ShipDate] [datetime] NULL,
[L_CommitDate] [datetime] NULL,
[L_ReceiptDate] [datetime] NULL,
[L_ShipInstruct] [varchar](64) NULL,
[L_ShipMode] [varchar](64) NULL,
[L_Comment] [varchar](64) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[nation](
[N_NationKey] [int] NULL,
[N_Name] [varchar](64) NULL,
[N_RegionKey] [int] NULL,
[N_Comment] [varchar](160) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[orders](
[O_OrderKey] [int] NULL,
[O_CustKey] [int] NULL,
[O_OrderStatus] [varchar](64) NULL,
[O_TotalPrice] [decimal](13, 2) NULL,
[O_OrderDate] [datetime] NULL,
[O_OrderPriority] [varchar](15) NULL,
[O_Clerk] [varchar](64) NULL,
[O_ShipPriority] [int] NULL,
[O_Comment] [varchar](80) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[part](
[P_PartKey] [int] NULL,
[P_Name] [varchar](64) NULL,
[P_Mfgr] [varchar](64) NULL,
[P_Brand] [varchar](64) NULL,
[P_Type] [varchar](64) NULL,
[P_Size] [int] NULL,
[P_Container] [varchar](64) NULL,
[P_RetailPrice] [decimal](13, 2) NULL,
[P_Comment] [varchar](64) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[partsupp](
[PS_PartKey] [int] NULL,
[PS_SuppKey] [int] NULL,
[PS_AvailQty] [int] NULL,
[PS_SupplyCost] [decimal](13, 2) NULL,
[PS_Comment] [varchar](200) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[region](
[R_RegionKey] [int] NULL,
[R_Name] [varchar](64) NULL,
[R_Comment] [varchar](160) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[supplier](
[S_SuppKey] [int] NULL,
[S_Name] [varchar](64) NULL,
[S_Address] [varchar](64) NULL,
[S_NationKey] [int] NULL,
[S_Phone] [varchar](18) NULL,
[S_AcctBal] [decimal](13, 2) NULL,
[S_Comment] [varchar](105) NULL,
[skip] [varchar](64) NULL
) ON [PRIMARY]
GO
Finally, we get to load the data. I had the source files in the ‘C:\TPC-H\source\’ directory, loading into the database on the external drive (to reduce I/O contention).
The data was loaded using the following:
BULK INSERT part FROM 'C:\TPC-H\source\part.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT customer FROM 'C:\TPC-H\source\customer.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT orders FROM 'C:\TPC-H\source\orders.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT partsupp FROM 'C:\TPC-H\source\partsupp.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT supplier FROM 'c:\TPC-H\source\supplier.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT lineitem FROM 'C:\TPC-H\source\lineitem.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT nation FROM 'C:\TPC-H\source\nation.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT region FROM 'C:\TPC-H\source\region.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
This data was loaded in the following times
Table | 1Gb Dataset | 10Gb Dataset |
Customers | 150,000 | 1,500,000 |
Line Items | 6,001,215 | 59,986,052 |
Nation | 25 | 25 |
Orders | 1,500,000 | 15,000,000 |
Part | 200,000 | 2,000,000 |
PartSupp (Part Supplier) | 800,000 | 8,000,000 |
Supplier | 10,000 | 100,000 |
Region | 5 | 5 |
Resulting source file size | 1.03Gb | 10.5Gb |
Time to load | 50 seconds | 9 minutes 56 seconds |
Resulting SQL DB Size | 1.13Gb | 11.38Gb |
This gives us a basis from which to run the benchmarks, and do some optimisations.
Next time, I’ll look at running through some of the optimisations, and get some benchmarks out of this system.