architecture, Data Engineering

Big Data Jargon | FAQ’s and everything you wanted to know and didn’t ask about Big Data …

What is Big Data?

Everyone has their their own definition, I will share with you my definition, and i hope it helps. Big Data is an ecosystem. like saying computer science , electronics etc. The ecosystem was designed to solve  particular set of problems :

  1. Ingestion  –  The data coming into our data lake.
  2. Transformation – converting a data from one type to another – e.g  JSON into CSV.
  3. Modeling – Querying and analyzing the data
  4. Presentation – Queuing the data from your BI and or Application.

So, again what is big data? it is a methodology to solve problems of scalability of tour data platform. you have a big data problem if your infrastructure is unable to process the amount of data you currently have.  This methodology was to designed to help analyze let in a scalable manner and grow as you grow. see example below in the picture, and read more in this lecure: AWS Big Data demystified.

What is Scale up ( = Scale Vertically)?

Scale up – An old Architecture practice. Used mainly in OLTP Databases and monolith applications.  Imagine any Database, and you seem to be having difficulties querying data on it. Then, your IT guys will hint it is probably a resources issue,  and they will throw some more resources on it (CPU, RAM, faster disk). It will probably work, until you hit a point the utilization is low, and you still unable to process data. This my friend is a big data problem, and you may need to consider to switch to a technology where the architecture practice is Scale Out.

What is Scale Out ( = Scale Horizontally) ?

An current architecture practice, Used mainly in distributed applications such as Hadoop. The basic id is – when you hit a stage where the cluster is responding slow, just add another machine to the cluster and entire workload will benefit from this change as the workload is equally distributed on all the machines in the cluster. the more node you have in the cluster, the more performance you get.

Scale Up VS Scale Out ?

Scale Up ( = Scale Vertically) pros and cons

Usually a Small cluster , Usually active/passive like architecture. When you ran out of resources , you increase resources per machine. In the world of Big Data, it is good idea for Power Queries such as Joins. Anti pattern of such a architecture is  Parallelism. Many analysts working on the same cluster. I.e 50 analysts running join query on a scale up architecture, will not work as each of the join queries will consume most the machines resources. 

Scale Out ( = Scale Vertically) pros and cons

Ran into a performance problem? just add more servers and the entire cluster will enjoy performance boost, as the application Distributed : Each node can handle a fraction of the task. Thus, the Pros of this architecture is parallelism.  The Cons of this architecture – is power queries, i.e imagine a really big Join, running on a cluster of medium resources nodes. Since the the task is distributed among the node of the cluster, eventually the there will one machine getting all the intermediate results from thousands nodes running the query, and in the in the of the day – the amount of memory in this last node is limited, and network overhead will slow you down drastically.

What is Structured Data?

Anything you can put in a table, or a comma separated file such CSV. I.e columns and rows.

Challenges associated with CSV like files?

  1. Encoding…
  2. Charset….
  3. Parsing a CSV with a large number of columns is a world of PAIN. AWS GLUE helps with that as it detect the schema automatically.
  4. Once your change a column in a CSV, any ETL using that modified CSV based on the original file schema – will brake.

What is Semi – Structured Data?

Files like XML, JSON, nested JSON , which are text based files, with no rigid schema schema.

Benefits associated with json?

  1. Once your change a column in a JSON, ant ETL using that modified JSON based on the original file schema – will NOT brake.

Challenges associated with JSON like files?

json files may contain highly nested and complex data structures. which will make the parsing world of pain.

What is Unstructured Data?

Pictures, Movies, audio files, any binary file (non readable by any text file editor)

You simply can insert binary files to most RDBMS systems.

Big Data in  Cloud vs Big data in a datacenter? When do I Choose Datacenter for Big Data workloads and when do I choose cloud?

You can do big data in the cloud and in a data center. both can be cheap, fast, and simple.   In the past there was only option build you Hadoop cluster in a datacenter. accumulate the data forever, and this would be your only option.

Nows days there are good cloud ecosystem with different Big Data PaaS solution designed to help you manage Big Data at scale. The common problem is not performance or simplicity. The problem is  how to manage costs in the cloud. In a data center it is simple, you buy server and you decommission it – when it is DEAD after a minimum of 5 years (vendor warranty can be extended to  5 years)…. a typical server can last 7 to 10 years, depending on the average utilization.

Basically there advantages and disadvantages for each decision. It really depends on the use case.

Advantages of using a data center for your big data?

  1. Predictable costs
  2. Could be much more cheaper than cloud if you take into a large time frame such as three years and above.

Disadvantages of using a data center for big data?

  1. Hard to get started, building a DC takes time. Ordering Servers take weeks.
  2. When there is a problem with the server , like faulty disk, a physical access is required.
  3. Designing a datacenter with High availability is hard.
  4. Designing a datacenter with disaster recovery is harder. image you have  PB of data. you need manages an offsite storage and dedicated private network simply to keep a copy of data nobody will use unless there is s disaster like earth queue.
  5. Estimating the future traffic is a bitch. You have to design for peek, and this means idle resource and lot of waste in non rush hours.

Use datacenter  for big data when:

  1. You already have a datacenter in the organization and a dedicated OPS team with fast turn around.
  2. When you need a highly custom hardware.
  3. You have a country regulation that forbids you from taking the data out of the country.
  4. When you are planning to build your own private cloud (b/c you have have too, self service is a MUST these days)

Use Cloud for Big Data when:

  1. You are not sure about the project success rate.
  2. You are getting started with small data and unsure about the data growth rate
  3. When time to market is company wide priority.
  4. When datacenter is not part of your company business.
  5. When most of the time of day you are idle, and you have rush hours for a short period. Auto Scaling in the cloud can reduce your costs.
  6. By the way… each year about 5% of costs is decreased. basically over time, simply by upgrading to the newest and latest machines, with same amount of resources your get more for less.

Which Cloud vendor should you choose for Big Data?

Wow…. Talk about the million dollar questions. The truth is that , It really really really… depends on the use case – don’ t hate me. hate the game. I can compare cloud vendor on the technical level, I can compare market share of each of the cloud vendors and so on, but each company, each industry has it own challenges. Each Industry will benefit a different cloud provider.

I would use the following rules of thumbs for selecting a cloud vendor for Big Data applications:

  1. Does this cloud vendor has the managed services (PaaS) you need for your applications.
    1. OLTP databases
    2. DW solutions
    3. Big Data Solution such as messaging, streaming
    4. Storage – yes, managed storage solution has many features that are useful for you, such as cross region replication, and cross account/project bucket sharing.
  2. Do you need and Hadoop distribution? if so do a technical deep dive of the feature list and flexibility in the managed service of this specific cloud vendor such
    1. list of open source technologies supported
    2. list of web client such as Hue/Zeppelin etc.
    3. Auto Scaling feature?
    4. Master Node availability ?
    5. Transient cluster bootstrapping?
  3. Get a strong sense of how good the support team of the specific Big data managed services you are going to use.

What is ACID ?

  • Atomic – all or nothing. Either query completed or not. (transaction)
  • Consistency – once transaction committed – that must conform with with the constraints of the schema.
    • Requires
      • a well defined schema (anti pattern to NoSQL). Great for READ intensive and  AD HOC queries.
      • Locking of data until transaction is finished. → performance bottleneck.
    • Consistency uses cases
      • Strongly consistent” : bank transactions must be ACCURATE
      • Eventually consistent”: social media feed updates. (not all users must see same results at the same time.
  • Isolations – concurrent queries – must run in isolation, this ensuring same result.
  • Durable – once transaction is executed, data will preserved.  E.g. power failure

You should know that not every databases is ACID. but not always it is important for you.

What is NoSQL database?

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

NoSQL antipattern – AD HOC querying.

Advantages of NoSQL databases?

  1. Non-relational and schema-less data model
  2. Low latency and high performance
  3. Highly scalable
  4. SQL queries are not well suited for the object oriented data structures that are used in most applications now.
  5. Some application operations require multiple and/or very complex queries. In that case, data mapping and query generation complexity raises too much and becomes difficult to maintain on the application side.

What are the different NoSQL Database types ?

  • Key value store.
  • Document store – e.g json.  document stores are used for aggregate objects that have no shared complex data between them and to quickly search or filter by some object properties.
  • column family. These are used for organizing data based on individual columns where actual data is used as a key to refer to whole data collection. This particular approach is used for very large scalable databases to greatly reduce time for searching data
  • Graph databases map more directly to object oriented programming models and are faster for highly associative data sets and graph queries. Furthermore they typically support ACID transaction properties in the same way as most RDBMS.

When do you normalize tables?

  1. eliminate redundant data, utilize space efficiently
  2. reduce update errors.
  3. in situations where we are storing immutable data such as financial transactions or a particular day’s price list.

When should you NOT normalize tables? denormalize use flat tables?

  1. Could you denormalize your schema to create flatter data structures that are easier to scale?
  2. When Multiple Joins are Needed to Produce a Commonly Accessed View
  3. decided to go with database partitioning/sharding then you will likely end up resorting to denormalization
  4. Normalization saves space, but space is cheap!
  5. Normalization simplifies updates, but reads are more common!
  6. Performance, performance, Performance

What are the common NoSQL use cases?

  1. Application requiring horizontal scaling : Entity-attribute-value = fetch by distinct value
    1. Mobile APP with millions of users – for each user: READ/WRITE
    2. The equivalent in OLTP is sharding – VERY COMPLEX.
  2. User Preferences, fetch by distinct value
    1. (structure may change dynamically)
    2. Per user do something.
  3. Low latency / session state
    1. Ad tech , and large scale web servers- capturing cookie state
    2. Gaming – game state
  4. Thousand of requests per seconds (WRITE only)
    1. Reality – voting
    2. Application monitoring and IoT- logs and json.

What is Hadoop?

The Extremely brief and demystified answer: Hadoop is an ecosystem of open source solutions designed to solve a variety of big data problems.

There are several layers in Hadoop cluster. Compute, Resource management, and Storage.

  1. Compute Layer which contains applications such as Spark and Hive. they run on top of an hadoop cluster .
  2. Resources management solutions, such as Yarn, that help distribute the workflows on the clusters nodes.
  3. Storage solutions such Hadoop File system, Kudu. Designed ro keep the your big data.

What is the difference between Row Based databases and Columnar databases?

Example: Mysql is row based. Redshift is columnar.

for an operational database (production database) you would use a row based database as, the use case is simple: given a user id – give me his attributes.

For Analytics – use a columnar database, as the use case is different – you want to country how many user have visited your website yesterday from a specific country

Using the correct database for your use case has a tremendous impact on performance. I.e when you need a row from the data, – a row based date makes sense. when you need only a column, a row based database will not make sense.

Furthermore, columnar database usually have some analytical SQL syntax and functions like window function, which are meaning less for operational databases.


I put a lot of thoughts into these blogs, so I could share the information in a clear and useful way. If you have any comments, thoughts, questions, or you need someone to consult with, feel free to contact me:

1 thought on “Big Data Jargon | FAQ’s and everything you wanted to know and didn’t ask about Big Data …”

Leave a Reply