Adding Purification Tags to Expression Vectors: His, GST, and SUMO Strategies
Adding Purification Tags to Expression Vectors: His, GST, and SUMO Strategies
Adding a purification tag to a protein expression vector is rarely the bottleneck people assume it will be. The actual bottleneck is choosing the right tag for your protein and then putting it in a place that doesn’t silently sabotage expression, folding, or activity. Most online guides describe how to clone in a 6×His tag and stop there. This post covers the three tags you actually choose between in a working lab — His, GST, and SUMO — and how to design each one without breaking your construct.
Written for someone who has done at least a few cloning rounds and now needs to add a tag to a real construct that will be used for protein purification, pull-down, or downstream assays.
How to choose between His, GST, and SUMO
The three tags solve different problems, and putting the wrong one on your protein can mean weeks of confused troubleshooting. Choose by what you need from the tag, not by what is already in the lab’s plasmid drawer.
| Tag | Size | Mechanism | Best for | Avoid when |
|---|---|---|---|---|
| 6×His (or 8×/10×His) | ~0.8–1.4 kDa | Binds Ni-NTA or Co-NTA via histidine coordination | Routine purification; small tag has minimal impact on folding/activity | Protein has surface-exposed histidine clusters that bind Ni nonspecifically |
| GST (glutathione S-transferase) | ~26 kDa | Binds glutathione resin; elutes with reduced glutathione | Insoluble proteins (acts as solubility tag); pull-down assays | Target is cytotoxic at high expression; GST dimerizes, can force artificial dimers in your protein |
| SUMO | ~12 kDa | Cleaved by SUMO protease (Ulp1/SENP) at the SUMO/target junction, leaving native N-terminus | Proteins that need a perfectly native N-terminal sequence; difficult-to-fold proteins (SUMO acts as folding chaperone) | You need C-terminal tagging — SUMO only works N-terminally |
A common pattern in E. coli expression: try a 6×His tag first. If the protein is insoluble or expression is poor, switch to a His-SUMO fusion — this gives you Ni-NTA capture plus SUMO’s solubility benefits, and the SUMO protease leaves a native N-terminus after cleavage. GST is a third-line choice for proteins that fail both, accepting that the 26 kDa fusion partner introduces its own complications.
Where to place the tag: N-terminal vs C-terminal
Tag position matters more than tag identity for many proteins. The rules:
- N-terminal tagging is the default. The ribosome translates from N to C, so an N-terminal tag is in place before the rest of the protein folds. Cleavable tags (TEV, PreScission, SUMO protease) work cleanly here.
- C-terminal tagging is required when the N-terminus is functionally important — a signal peptide that must be cleaved, a membrane anchor, or a protein where the native N-terminal residue is required for activity. C-terminal tags are less reliable for solubility rescue.
- Don’t put a tag on both ends unless you have a specific reason — you double the chance of folding interference, and dual-tag (tandem affinity) purification requires careful elution strategy.
For a protein with an unknown structure, the safest choice is N-terminal His-SUMO with a TEV cleavage site between SUMO and the protein. This gives flexibility: you can purify on Ni-NTA, cleave with SUMO protease for a native N-terminus, or cleave with TEV if you decide later you want a single residue to remain.
Designing the cassette: linkers, cleavage sites, and reading frame
This is where most cloning errors happen. The cassette must be in-frame with the start codon, in-frame through the entire fusion, and have correct stop placement.
A typical N-terminal 6×His + TEV cleavage cassette looks like:
ATG CAT CAC CAT CAC CAT CAC GGA TCC GAA AAC CTG TAT TTT CAG GGA [protein ORF] TAA M H H H H H H G S E N L Y F Q /G [protein] *
Read this carefully:
- Six histidines (CAT/CAC codons — both encode His and are favored in E. coli)
- A short linker (GS) for flexibility
- The TEV protease site ENLYFQ/G — TEV cleaves between Q and G, leaving a single G on your protein’s N-terminus
- Your protein’s ORF without its own start codon (the ATG belongs to the tag)
- A single stop codon (TAA is most common in E. coli)
For C-terminal tags, the inverse rule applies: do not include the protein’s stop codon. If you do, the ribosome stops before reaching the tag, and you get untagged protein. Many cloning failures trace to this single oversight.
Codon usage and host compatibility
The tag itself rarely needs codon optimization — standard His and SUMO sequences in commercial vectors (pET, pGEX, pETSUMO) are already tuned for E. coli. The protein you’re fusing is where the codon problem lives. If your protein has a low codon adaptation index for E. coli, the tag will express fine but the downstream protein won’t. Run the full fusion through a codon optimizer before ordering synthesis — treat the tag-protein junction as one continuous sequence, not two pieces glued together. The same principles apply as for any codon optimization for E. coli expression.
One subtle case: GST has a known rare codon at position 17 in its native sequence. Most modern pGEX vectors have already corrected this, but if you’re using a custom GST sequence (e.g., from an old academic plasmid), check the codons before assuming it will express. Rare codons in the tag itself cause expression failure that looks like a downstream problem.
Cloning strategy: how to actually get the tag in
Three patterns dominate, ranked by how robust they are when something goes wrong:
- Start from a tagged backbone. Vectors like pET-28a (His), pGEX-4T (GST), and pET-SUMO already have the tag in place upstream of an MCS. Clone your insert into the MCS in-frame and you’re done. This is the fastest path, and Addgene has hundreds of pre-validated tagged backbones.
- PCR-add the tag during amplification. Design forward primers that include the tag sequence as a 5′ overhang. This works for short tags (His, FLAG, HA) but becomes unwieldy past ~30 codons. A 6×His + TEV overhang is ~60 nt, which most synthesis vendors can produce on a primer.
- Gibson or Golden Gate to insert the tag separately. Useful when retrofitting a tag into an existing construct that doesn’t have one. Order the tag as a gBlock with overlaps to your vector linearization sites.
For verification after cloning, a diagnostic digest with an enzyme that cuts inside the tag region confirms the tag is present and in the right orientation. Plan the digest in your double-digest buffer compatibility check before going to the bench — tagged constructs often have a unique site introduced by the tag itself (e.g., NdeI in many pET vectors), and that site is your fastest verification handle.
What can go wrong, and how to spot it early
Most tag-related failures show up in one of three ways:
- Protein expresses but doesn’t bind the resin. Reading frame is off, or the tag is buried in a folding pocket. Sequence-verify the construct first — it’s almost always a frameshift at the tag-protein junction. If the sequence is correct, try the other terminus.
- Protein binds resin but elutes prematurely or never elutes. For His tags, check imidazole concentration — you may have a contaminating histidine-rich protein competing. For GST, check pH and reductant in the elution buffer. For SUMO, you’re using the wrong protease (Ulp1 for yeast SUMO, SENP for human SUMO — not interchangeable).
- Cleavage doesn’t go to completion. The cleavage site is partially buried. SUMO and TEV proteases need accessible sites; if your protein’s N-terminus folds back over the cleavage site, you get partial cleavage. The fix is usually to add a longer linker (3–5 residues) between the cleavage site and the protein.
None of these are catastrophic, but each one costs a week. Building the construct with the right tag, in the right place, with the right linker, is the cheapest insurance against all of them.
Quick-reference design checklist
- Tag chosen based on goal (purification, solubility, native N-terminus) — not lab convenience
- Position chosen based on protein biology (N-terminal default; C-terminal only when needed)
- Reading frame verified through the entire fusion — tag, linker, cleavage site, protein
- Native start codon removed (N-terminal tag) or stop codon removed (C-terminal tag)
- Codon optimization applied to the full fusion, not just the protein
- Cleavage site choice matches available proteases in the lab (TEV, SUMO, PreScission)
- Diagnostic restriction site identified for post-clone verification
- For complex fusions, ordered as a single gBlock from a synthesis vendor — not assembled from primers
If you’re designing a new bacterial expression construct from scratch, the six decisions framework for bacterial expression covers the upstream choices that determine whether your tag will even get translated. The tag is the last design decision, not the first.
Try PlasmidStudio
AI-assisted plasmid design with automated validation. Start free — $0 to sign up.
Get started free