If you have worked with DTS or SSIS packages you probably know that they can quickly become painful to work with, especially concerning versioning, logging and above all deployment.

I recently tried Ayende´s open source ETL framework RhinoETL, it tackles the ETL problem from a completely different angle compared to DTS / SSIS. At it's heart RhinoETL is a very simple .NET framework to handle an ETL process, the key components are processes, pipelines and operations. 

The process that I needed was a very simple one, namely to update a text table stored in multiple databases. The update could be of different types, for example swap every occurrence of a text translation to another and delete or update a specific row. In previously releases this was handled by writing manual update scripts. The release that I am working on currently however requires extensive changes to the texts in these tables spread over many databases and writing repetitive SQL scripts was not something that I felt doing. It felt like a good opportunity to try RhinoETL.

I began writing this input operation:

public class ReadWordList : InputCommandOperation
{
    public ReadWordList(string connectionStringName) 
      : base(connectionStringName) {  }

    protected override Row CreateRowFromReader(IDataReader reader)
    {
        return Row.FromReader(reader);                
    }

    protected override void PrepareCommand(IDbCommand cmd)
    {
        cmd.CommandText = "SELECT * FROM Wordlists";
    }
}

This is the first operation, its responsibility is to fill the pipeline with rows from the Worldlists table. The next operation is the one updating the rows. It takes as input a list of ITextChange instances that is the object that performs the change.

public class TextChangeOperation : AbstractOperation
{
    private IList<ITextChange> changes;
  
    public TextChangeOperation(IList<ITextChange> changes)
    {
        this.changes = changes;
    }

    public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
    {
        foreach (var row in rows)
        {
            foreach (var change in changes)
            {
                if (change.IsValidFor(row))
                    change.Perform(row);
            }

            yield return row;
        }
    }
}

There is a class hierarchy representing the different types of text changes:

 

image

The code for the GeneralTextSwap class is very simple:

public class GeneralTextSwap : ITextChange
{
    public string TextOld { get; set; }
    public string TextNew { get; set; }

    public bool IsValidFor(Row row)
    {
        var text = ((string) row["Text"]).Trim();
        return text == TextOld;
    }

    public void Perform(Row row)
    {
        row["TextOld"] = row["Text"];
        row["Text"] = TextNew;
    }
}

The point of storing the old text value is that I have an operation after the TextChangeOperation that logs all changes to a csv file. The changes for a specific release is then just defined as static list on a static class. For example:

public class ReleaseChanges_For_Jan09
{
  public static IList<ITextChange> List;

  static ReleaseChanges_For_Jan09()
  {
    List = new List<ITextChange>()
    {            
      new GeneralTextSwap() 
      {
         TextOld = "some old link value",
         TextNew = "some new link value"
      },
      new UpdateTextRow()
      {
          DbName = "DN_NAME",
          WorldList = "USER_PAGE",
          Name = "CROSS_APP_LINK",
          TextNew = "New link value"
      },
      new DeleteTextRow()
      {
          DbName = "DN_NAME_2",
          WorldList = "SOME_PAGE",
          Name = "SOME_LINK"
      }
    }
  }
}

The above is just an example in reality I have hundreds of general and specific changes, and yes this is legacy system which handles text and links very strangely. The above code could easily be handled by a simple sql script with about the same number of lines of TSQL code, the benefit of placing this inside an ETL written in C# is that I can easily reuse the change logic over multiple databases and I also get great logging and tracing of all changes. The point of this post is to show how easy it can be to use OOP to model and abstract an ETL process using RhinoETL. Even though this ETL process i very simple it serves as good example for showing how a TSQL script, DTS or SSIS packages can be rewritten and simplified using the power of an object oriented language.

7 comments:

Adam Tybor said...

I am interested to know what your performance is like with rhino-etl and how much data you are moving?

I have used rhino-etl in the past for similar things and its so much better than ssis, however the engine in SSIS blows away the rhino engine when it comes to high performance ETL's that need to run on regular intervals.

If complexity is high and performance is not an issue I use rhino, otherwise I am stuck with SSIS :(

Torkel Ödegaard said...

The performance is more than sufficient for my scenarios as I am only processing small tables with only about 1000-2000 rows.

have you mentioned your performance problems on the rhino-tools mailing list, maybe there is something that can be done about them.

Pete said...

Does RhinoETL provide a feature to insert into identity columns? For example, pre-populating a fresh database with dictionary/test data?

Jon said...

I will preface this comment by saying that I haven't tried Rhino yet.

However, being an experienced ETL guy (Informatica, Datastage, DTS/SSIS), I certainly understand SSIS's many limitations and frequently am frustrated by them.


That said, what SSIS IS good at is allowing the developer to build complex ETLs without much coding.

The thought of having to write/maintain even as much code as this seems like a step in the wrong direction from a productivity standpoint, unless the benefits in flexibility and maintenance (not to mention logging, deployment and versioning) are great.

Am I missing something here?

Torkel Ödegaard said...

There isn't that much code, and it easily extensible, maintainable, works great with source control, easy to deploy, etc

Anonymous said...

With a little bit of code things can become much more maintainable. We started using ssis and found it very difficult to deal with unstructured data. We invested some time in creating custom task, source, transform and destination adapters. SSIS development is now very flexible and fast for us. And if there is something that we do all the time in a package we create a custom component and get reuse that way. In my experience if you should combine etl + coding to deliver the best solution.

Anonymous said...

Can you help me how to implement this question in coding programming c++ using object oriented software development please....The question is for each semester,student needs to register for the subject they want to take for that semester.Each student is allowed to register to up to 15 credits.Each subject consists of 3 credit.During registration,the system will check the student record to fine out whether the student has any debts with the college or not.If the student don't have any debts,then registration is allowed.