-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Hadoop Benchmark
Here are my steps. First, checkout hadoop 2.10.0 binary, untar, and cd in to the hadoop directory.
wget http://apache.mirrors.hoobly.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz
tar xvf hadoop-2.10.0.tar.gz
cd hadoop-2.10.0
Modify the file ./etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.seaweedfs.impl</name>
<value>seaweed.hdfs.SeaweedFileSystem</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>seaweedfs://localhost:8888</value>
</property>
</configuration>
Then get the seaweedfs hadoop client jar.
cd share/hadoop/common/lib/
wget https://repo1.maven.org/maven2/com/seaweedfs/seaweedfs-hadoop2-client/3.80/seaweedfs-hadoop2-client-3.80.jar
The TestDFSIO benchmark is used for measuring I/O (read/write) performance.
However, the generated data to write is all zero. SeaweedFS automatically compress this kind of data. So this is not scientific for now.
Start the TestDFSIO write tests:
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0-tests.jar TestDFSIO -write -nrFiles 8 -size 32GB -bufferSize 8388608 -resFile /tmp/TestDFSIOwrite.txt
...
20/07/25 16:48:21 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
20/07/25 16:48:21 INFO fs.TestDFSIO: Date & time: Sat Jul 25 16:48:21 PDT 2020
20/07/25 16:48:21 INFO fs.TestDFSIO: Number of files: 8
20/07/25 16:48:21 INFO fs.TestDFSIO: Total MBytes processed: 262144
20/07/25 16:48:21 INFO fs.TestDFSIO: Throughput mb/sec: 399.16
20/07/25 16:48:21 INFO fs.TestDFSIO: Average IO rate mb/sec: 399.34
20/07/25 16:48:21 INFO fs.TestDFSIO: IO rate std deviation: 8.56
20/07/25 16:48:21 INFO fs.TestDFSIO: Test exec time sec: 659.45
20/07/25 16:48:21 INFO fs.TestDFSIO:
Start the TestDFSIO read tests:
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0-tests.jar TestDFSIO -read -nrFiles 8 -size 32GB -bufferSize 8388608 -resFile /tmp/TestDFSIOwrite.txt
...
20/07/17 15:59:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
20/07/17 15:59:38 INFO fs.TestDFSIO: Date & time: Fri Jul 17 15:59:38 PDT 2020
20/07/17 15:59:38 INFO fs.TestDFSIO: Number of files: 8
20/07/17 15:59:38 INFO fs.TestDFSIO: Total MBytes processed: 8192
20/07/17 15:59:38 INFO fs.TestDFSIO: Throughput mb/sec: 393.26
20/07/17 15:59:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 393.72
20/07/17 15:59:38 INFO fs.TestDFSIO: IO rate std deviation: 13.33
20/07/17 15:59:38 INFO fs.TestDFSIO: Test exec time sec: 22.76
20/07/17 15:59:38 INFO fs.TestDFSIO:
To test the performance of HDFS and SeaweedFS, we did a comparison by running 4 common spark operators, such as count
, group by
, join
and write
, for group by
and join
, there is a count
followed to act.
The basic configuration information of cluster is as follows:
-
HDFS:
- Node number: 25
- Total disks: 36disk * 25node = 900disk
- Disk capacity: 3.7T SATA
- Total disk capacity: 3.19PB
- Replication: 5
-
SeaweedFS:
- Node number: 6(3+3 rack)
- Disk capacity: 3.7T SATA
- Cluster max volume: 21500
- Total disk capacity: 799TB
- Replication policy: 010
Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and SeaweedFS. The amount of the data is 100 million records, and stored in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed count
, group by
and join
by 100 times , and write
by 10 times, on each DataFrame.
As for count
, SeaweedFS's advantage is obvious. The average time of the DataFrame from HDFS is 4.05 seconds, while SeaweedFS is only 0.659. Following is the result:
Summary | HDFS | SeaweedFS |
---|---|---|
Count | 100 | 100 |
Mean | 4.050 | 0.659 |
Stddev | 0.264 | 0.941 |
Min | 3.678 | 0.392 |
Max | 5.692 | 9.688 |
As for write
, we wrote the DataFrame from HDFS to SeaweedFS, and wrote the DataFrame from HCFS to SeaweedFS. Following is the result:
Summary | HDFS | SeaweedFS |
---|---|---|
Count | 10 | 10 |
Mean | 234.279 | 232.078 |
Stddev | 26.823 | 12.652 |
Min | 216.931 | 214.349 |
Max | 307.330 | 252.375 |
As for group by
, following is the result:
Summary | HDFS | SeaweedFS |
---|---|---|
Count | 100 | 100 |
Mean | 14.121 | 12.515 |
Stddev | 1.972 | 1.255 |
Min | 12.879 | 11.322 |
Max | 32.296 | 22.573 |
As for join
, every DataFrame join with itself on one column. Following is the result:
Summary | HDFS | SeaweedFS |
---|---|---|
Count | 100 | 100 |
Mean | 25.684 | 23.897 |
Stddev | 0.934 | 1.381 |
Min | 24.006 | 22.275 |
Max | 30.991 | 30.279 |
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup Setup
- Environment Variables
- Filer Setup
- Directories and Files
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
- Amazon S3 API
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up